doc: Plan test harness with introspection capabilities (IT-05)

This commit is contained in:
scawful
2025-10-02 15:00:34 -04:00
parent fdead0e9e5
commit 3a573c0764
5 changed files with 1552 additions and 118 deletions

View File

@@ -24,7 +24,12 @@ This document is the **source of truth** for the z3ed CLI architecture and desig
- **Proposal Registry**: Cross-session proposal tracking with disk persistence
**🔄 In Progress**:
- **E2E Validation**: Testing complete workflow (80% done, window detection needs fix)
- **Test Harness Enhancements (IT-05 to IT-09)**: Expanding from basic automation to comprehensive testing platform
- Test introspection APIs for status/results polling
- Widget discovery for AI-driven interactions
- Test recording/replay for regression testing
- Enhanced error reporting with screenshots
- CI/CD integration with standardized test formats
**📋 Planned Next**:
- **Policy Evaluation Framework (AW-04)**: YAML-based constraints for proposal acceptance
@@ -51,6 +56,13 @@ The z3ed CLI is built on three core pillars:
**gRPC Test Harness**: Embedded gRPC server in YAZE enables remote GUI automation for testing and AI-driven workflows.
**Comprehensive Testing Platform**: Test harness evolved beyond basic automation to support:
- **Widget Discovery**: AI agents can enumerate available GUI interactions dynamically
- **Test Introspection**: Query test status, results, and execution queue in real-time
- **Recording & Replay**: Capture test sessions as JSON scripts for regression testing
- **CI/CD Integration**: Standardized test suite format with JUnit XML output
- **Enhanced Debugging**: Screenshot capture, widget state dumps, and execution context on failures
**Cross-Platform Foundation**: Core built for macOS/Linux with Windows support planned via vcpkg.
## 3. Proposed CLI Architecture: Resource-Oriented Commands
@@ -203,6 +215,257 @@ The `z3ed agent` command is the main entry point for the agent. It has the follo
### 8.3. AI Model & Protocol Strategy
- **Models**: The framework will support both local and remote AI models, offering flexibility and catering to different user needs.
---
## 9. Test Harness Evolution: From Automation to Platform
The ImGuiTestHarness has evolved from a basic GUI automation tool into a comprehensive testing platform that serves dual purposes: **AI-driven generative workflows** and **traditional GUI testing**.
### 9.1. Current Capabilities (IT-01 to IT-04) ✅
**Core Automation** (6 RPCs):
- `Ping` - Health check and version verification
- `Click` - Button, menu, and tab interactions
- `Type` - Text input with focus management
- `Wait` - Condition polling (window visibility, element state)
- `Assert` - State validation (visible, enabled, exists)
- `Screenshot` - Capture (stub, needs implementation)
**Integration Points**:
- ImGuiTestEngine dynamic test registration
- Async test queue with frame-accurate timing
- gRPC server embedded in YAZE process
- Cross-platform build (macOS validated, Windows planned)
**Proven Use Cases**:
- Menu-driven editor opening (Overworld, Dungeon, etc.)
- Window visibility validation
- Multi-step workflows with timing dependencies
- Natural language test prompts via `z3ed agent test`
### 9.2. Limitations Identified
**For AI Agents**:
- ❌ Can't discover available widgets → must hardcode target names
- ❌ No way to query test results → async tests return immediately with no status
- ❌ No structured error context → failures lack screenshots and state dumps
- ❌ Limited to predefined actions → can't learn new interaction patterns
**For Traditional Testing**:
- ❌ No test recording → can't capture manual workflows for regression
- ❌ No test suite format → can't organize tests into smoke/regression/nightly groups
- ❌ No CI integration → can't run tests in automated pipelines
- ❌ No result persistence → test history lost between sessions
- ❌ Poor debugging → failures don't capture visual or state context
### 9.3. Enhancement Roadmap (IT-05 to IT-09)
#### IT-05: Test Introspection API (6-8 hours)
**Problem**: Tests execute asynchronously with no way to query status or results. Clients poll blindly or give up early.
**Solution**: Add 3 new RPCs:
- `GetTestStatus(test_id)` → Returns queued/running/passed/failed/timeout with execution time
- `ListTests(category_filter)` → Enumerates all registered tests with metadata
- `GetTestResults(test_id)` → Retrieves detailed results: logs, assertions, metrics
**Benefits**:
- AI agents can poll for test completion reliably
- CLI can show real-time progress bars
- Test history enables trend analysis (flaky tests, performance regressions)
**Example Flow**:
```bash
# Queue test (returns immediately with test_id)
TEST_ID=$(z3ed agent test --prompt "Open Overworld" --output json | jq -r '.test_id')
# Poll until complete
while true; do
STATUS=$(z3ed agent test status --test-id $TEST_ID --format json | jq -r '.status')
[[ "$STATUS" =~ ^(PASSED|FAILED|TIMEOUT)$ ]] && break
sleep 0.5
done
# Get results
z3ed agent test results --test-id $TEST_ID --include-logs
```
#### IT-06: Widget Discovery API (4-6 hours)
**Problem**: AI agents must know widget names in advance. Can't adapt to UI changes or learn new editors.
**Solution**: Add `DiscoverWidgets` RPC:
- Enumerates all windows currently open
- Lists interactive widgets per window: buttons, inputs, menus, tabs
- Returns metadata: ID, label, type, enabled state, position
- Provides suggested action templates (e.g., "Click button:Save")
**Benefits**:
- AI agents discover GUI capabilities dynamically
- Test scripts validate expected widgets exist
- LLM prompts improved with natural language descriptions
- Reduces brittleness from hardcoded widget names
**Example Flow**:
```python
# AI agent workflow
widgets = z3ed_client.DiscoverWidgets(window_filter="Overworld")
# LLM prompt: "Which buttons are available in the Overworld editor?"
available_actions = [w.suggested_action for w in widgets.buttons if w.is_enabled]
# LLM generates: "Click button:Save Changes"
z3ed_client.Click(target="button:Save Changes")
```
#### IT-07: Test Recording & Replay (8-10 hours)
**Problem**: No way to capture manual workflows for regression. Testers repeat same actions every release.
**Solution**: Add recording workflow:
- `StartRecording(output_file)` → Begins capturing all RPC calls
- `StopRecording()` → Saves to JSON test script
- `ReplayTest(test_script)` → Executes recorded actions with validation
**Test Script Format** (JSON):
```json
{
"name": "Overworld Tile Edit Test",
"steps": [
{ "action": "Click", "target": "menuitem: Overworld Editor" },
{ "action": "Wait", "condition": "window_visible:Overworld", "timeout_ms": 5000 },
{ "action": "Click", "target": "button:Select Tile" },
{ "action": "Assert", "condition": "enabled:button:Apply" }
]
}
```
**Benefits**:
- QA engineers record test scenarios once, replay forever
- Test scripts version controlled alongside code
- Parameterized tests (e.g., test with different ROMs)
- Foundation for test suite management (smoke, regression, nightly)
#### IT-08: Enhanced Error Reporting (3-4 hours)
**Problem**: Test failures lack context. Developer sees "Window not visible" but doesn't know why.
**Solution**: Capture rich context on failure:
- Screenshot (implement stub RPC)
- Widget state dump (full hierarchy with properties)
- Execution context (active window, recent events, resource stats)
- HTML report generation with annotated screenshots
**Example Error Report**:
```json
{
"test_id": "grpc_wait_12345678",
"failure_reason": "Timeout waiting for window_visible:Overworld",
"screenshot": "test-results/failure_12345678.png",
"widget_state": {
"visible_windows": ["Main Window", "Debug"],
"overworld_window": { "exists": true, "visible": false, "reason": "not_initialized" }
},
"execution_context": {
"last_click": "menuitem: Overworld Editor",
"frames_since_click": 150,
"resource_stats": { "memory_mb": 245, "framerate": 58.3 }
}
}
```
**Benefits**:
- Developers fix failing tests faster (visual + state context)
- Flaky test debugging (see exact UI state at failure)
- Test reports shareable with QA/PM (HTML with screenshots)
#### IT-09: CI/CD Integration (2-3 hours)
**Problem**: Tests run manually. No automated regression on PR/merge.
**Solution**: Standardize test execution for CI:
- YAML test suite format (groups, dependencies, parallel execution)
- `z3ed test suite run` command with `--ci-mode`
- JUnit XML output for CI parsers (Jenkins, GitHub Actions)
- Exit codes: 0=pass, 1=fail, 2=error
**GitHub Actions Example**:
```yaml
name: GUI Tests
on: [push, pull_request]
jobs:
gui-tests:
runs-on: macos-latest
steps:
- name: Build YAZE
run: cmake --build build --target yaze --target z3ed
- name: Start test harness
run: ./build/bin/yaze --enable_test_harness --headless &
- name: Run smoke tests
run: ./build/bin/z3ed test suite run tests/smoke.yaml --ci-mode
- name: Upload results
uses: actions/upload-artifact@v2
with:
name: test-results
path: test-results/
```
**Benefits**:
- Catch regressions before merge
- Test history tracked in CI dashboard
- Parallel execution for faster feedback
- Flaky test detection (retry logic, failure rates)
### 9.4. Unified Testing Vision
The enhanced test harness serves three audiences:
**For AI Agents** (Generative Workflows):
- Widget discovery enables dynamic learning
- Test introspection provides reliable feedback loops
- Recording captures expert workflows for training data
**For Developers** (Unit/Integration Testing):
- Test suites organize tests by scope (smoke, regression, nightly)
- CI integration catches regressions early
- Rich error reporting speeds up debugging
**For QA Engineers** (Manual Testing Automation):
- Record manual workflows once, replay forever
- Parameterized tests reduce maintenance burden
- Visual test reports simplify communication
**Shared Infrastructure**:
- Single gRPC server handles all test types
- Consistent test script format (JSON/YAML)
- Common result storage and reporting
- Cross-platform support (macOS, Windows, Linux)
### 9.5. Implementation Priority
**Phase 1: Foundation** (Already Complete ✅)
- Core automation RPCs (Ping, Click, Type, Wait, Assert)
- ImGuiTestEngine integration
- gRPC server lifecycle
- Basic E2E validation
**Phase 2: Introspection & Discovery** (IT-05, IT-06 - 10-14 hours)
- Test status/results querying
- Widget enumeration API
- Async test management
- *Critical for AI agents*
**Phase 3: Recording & Replay** (IT-07 - 8-10 hours)
- Test script format
- Recording workflow
- Replay engine
- *Unlocks regression testing*
**Phase 4: Production Readiness** (IT-08, IT-09 - 5-7 hours)
- Screenshot implementation
- Error context capture
- CI/CD integration
- *Enables automated pipelines*
**Total Estimated Effort**: 23-31 hours beyond current implementation
---
- **Local Models (macOS Setup)**: For privacy, offline use, and reduced operational costs, integration with local LLMs via [Ollama](https://ollama.ai/) is a priority. Users can easily install Ollama on macOS and pull models optimized for code generation, such as `codellama:7b`. The `z3ed` agent will communicate with Ollama's local API endpoint.
- **Remote Models (Gemini API)**: For more complex tasks requiring advanced reasoning capabilities, integration with powerful remote models like the Gemini API will be available. Users will need to provide a `GEMINI_API_KEY` environment variable. A new `GeminiAIService` class will be implemented to handle the secure API requests and responses.
- **Protocol**: A robust, yet simple, JSON-based protocol will be used for communication between `z3ed` and the AI model. This ensures structured data exchange, critical for reliable parsing and execution. The `z3ed` tool will serialize the user's prompt, current ROM context, available `z3ed` commands, and any relevant `ImGuiTestEngine` capabilities into a JSON object. The AI model will be expected to return a JSON object containing the sequence of commands to be executed, along with potential explanations or confidence scores.