doc: Plan test harness with introspection capabilities (IT-05)

2025-10-02 15:00:34 -04:00
parent fdead0e9e5
commit 3a573c0764
5 changed files with 1552 additions and 118 deletions
--- a/docs/z3ed/E6-z3ed-cli-design.md
+++ b/docs/z3ed/E6-z3ed-cli-design.md
@@ -24,7 +24,12 @@ This document is the **source of truth** for the z3ed CLI architecture and desig
 - **Proposal Registry**: Cross-session proposal tracking with disk persistence

 **🔄 In Progress**:
- **E2E Validation**: Testing complete workflow (80% done, window detection needs fix)
+- **Test Harness Enhancements (IT-05 to IT-09)**: Expanding from basic automation to comprehensive testing platform
+  - Test introspection APIs for status/results polling
+  - Widget discovery for AI-driven interactions
+  - Test recording/replay for regression testing
+  - Enhanced error reporting with screenshots
+  - CI/CD integration with standardized test formats

 **📋 Planned Next**:
 - **Policy Evaluation Framework (AW-04)**: YAML-based constraints for proposal acceptance
@@ -51,6 +56,13 @@ The z3ed CLI is built on three core pillars:

 **gRPC Test Harness**: Embedded gRPC server in YAZE enables remote GUI automation for testing and AI-driven workflows.

+**Comprehensive Testing Platform**: Test harness evolved beyond basic automation to support:
+- **Widget Discovery**: AI agents can enumerate available GUI interactions dynamically
+- **Test Introspection**: Query test status, results, and execution queue in real-time
+- **Recording & Replay**: Capture test sessions as JSON scripts for regression testing
+- **CI/CD Integration**: Standardized test suite format with JUnit XML output
+- **Enhanced Debugging**: Screenshot capture, widget state dumps, and execution context on failures
+
 **Cross-Platform Foundation**: Core built for macOS/Linux with Windows support planned via vcpkg.

 ## 3. Proposed CLI Architecture: Resource-Oriented Commands
@@ -203,6 +215,257 @@ The `z3ed agent` command is the main entry point for the agent. It has the follo
 ### 8.3. AI Model & Protocol Strategy

 - **Models**: The framework will support both local and remote AI models, offering flexibility and catering to different user needs.
+
+---
+
+## 9. Test Harness Evolution: From Automation to Platform
+
+The ImGuiTestHarness has evolved from a basic GUI automation tool into a comprehensive testing platform that serves dual purposes: **AI-driven generative workflows** and **traditional GUI testing**.
+
+### 9.1. Current Capabilities (IT-01 to IT-04) ✅
+
+**Core Automation** (6 RPCs):
+- `Ping` - Health check and version verification
+- `Click` - Button, menu, and tab interactions
+- `Type` - Text input with focus management
+- `Wait` - Condition polling (window visibility, element state)
+- `Assert` - State validation (visible, enabled, exists)
+- `Screenshot` - Capture (stub, needs implementation)
+
+**Integration Points**:
+- ImGuiTestEngine dynamic test registration
+- Async test queue with frame-accurate timing
+- gRPC server embedded in YAZE process
+- Cross-platform build (macOS validated, Windows planned)
+
+**Proven Use Cases**:
+- Menu-driven editor opening (Overworld, Dungeon, etc.)
+- Window visibility validation
+- Multi-step workflows with timing dependencies
+- Natural language test prompts via `z3ed agent test`
+
+### 9.2. Limitations Identified
+
+**For AI Agents**:
+- ❌ Can't discover available widgets → must hardcode target names
+- ❌ No way to query test results → async tests return immediately with no status
+- ❌ No structured error context → failures lack screenshots and state dumps
+- ❌ Limited to predefined actions → can't learn new interaction patterns
+
+**For Traditional Testing**:
+- ❌ No test recording → can't capture manual workflows for regression
+- ❌ No test suite format → can't organize tests into smoke/regression/nightly groups
+- ❌ No CI integration → can't run tests in automated pipelines
+- ❌ No result persistence → test history lost between sessions
+- ❌ Poor debugging → failures don't capture visual or state context
+
+### 9.3. Enhancement Roadmap (IT-05 to IT-09)
+
+#### IT-05: Test Introspection API (6-8 hours)
+**Problem**: Tests execute asynchronously with no way to query status or results. Clients poll blindly or give up early.
+
+**Solution**: Add 3 new RPCs:
+- `GetTestStatus(test_id)` → Returns queued/running/passed/failed/timeout with execution time
+- `ListTests(category_filter)` → Enumerates all registered tests with metadata
+- `GetTestResults(test_id)` → Retrieves detailed results: logs, assertions, metrics
+
+**Benefits**:
+- AI agents can poll for test completion reliably
+- CLI can show real-time progress bars
+- Test history enables trend analysis (flaky tests, performance regressions)
+
+**Example Flow**:
+```bash
+# Queue test (returns immediately with test_id)
+TEST_ID=$(z3ed agent test --prompt "Open Overworld" --output json | jq -r '.test_id')
+
+# Poll until complete
+while true; do
+  STATUS=$(z3ed agent test status --test-id $TEST_ID --format json | jq -r '.status')
+  [[ "$STATUS" =~ ^(PASSED|FAILED|TIMEOUT)$ ]] && break
+  sleep 0.5
+done
+
+# Get results
+z3ed agent test results --test-id $TEST_ID --include-logs
+```
+
+#### IT-06: Widget Discovery API (4-6 hours)
+**Problem**: AI agents must know widget names in advance. Can't adapt to UI changes or learn new editors.
+
+**Solution**: Add `DiscoverWidgets` RPC:
+- Enumerates all windows currently open
+- Lists interactive widgets per window: buttons, inputs, menus, tabs
+- Returns metadata: ID, label, type, enabled state, position
+- Provides suggested action templates (e.g., "Click button:Save")
+
+**Benefits**:
+- AI agents discover GUI capabilities dynamically
+- Test scripts validate expected widgets exist
+- LLM prompts improved with natural language descriptions
+- Reduces brittleness from hardcoded widget names
+
+**Example Flow**:
+```python
+# AI agent workflow
+widgets = z3ed_client.DiscoverWidgets(window_filter="Overworld")
+
+# LLM prompt: "Which buttons are available in the Overworld editor?"
+available_actions = [w.suggested_action for w in widgets.buttons if w.is_enabled]
+
+# LLM generates: "Click button:Save Changes"
+z3ed_client.Click(target="button:Save Changes")
+```
+
+#### IT-07: Test Recording & Replay (8-10 hours)
+**Problem**: No way to capture manual workflows for regression. Testers repeat same actions every release.
+
+**Solution**: Add recording workflow:
+- `StartRecording(output_file)` → Begins capturing all RPC calls
+- `StopRecording()` → Saves to JSON test script
+- `ReplayTest(test_script)` → Executes recorded actions with validation
+
+**Test Script Format** (JSON):
+```json
+{
+  "name": "Overworld Tile Edit Test",
+  "steps": [
+    { "action": "Click", "target": "menuitem: Overworld Editor" },
+    { "action": "Wait", "condition": "window_visible:Overworld", "timeout_ms": 5000 },
+    { "action": "Click", "target": "button:Select Tile" },
+    { "action": "Assert", "condition": "enabled:button:Apply" }
+  ]
+}
+```
+
+**Benefits**:
+- QA engineers record test scenarios once, replay forever
+- Test scripts version controlled alongside code
+- Parameterized tests (e.g., test with different ROMs)
+- Foundation for test suite management (smoke, regression, nightly)
+
+#### IT-08: Enhanced Error Reporting (3-4 hours)
+**Problem**: Test failures lack context. Developer sees "Window not visible" but doesn't know why.
+
+**Solution**: Capture rich context on failure:
+- Screenshot (implement stub RPC)
+- Widget state dump (full hierarchy with properties)
+- Execution context (active window, recent events, resource stats)
+- HTML report generation with annotated screenshots
+
+**Example Error Report**:
+```json
+{
+  "test_id": "grpc_wait_12345678",
+  "failure_reason": "Timeout waiting for window_visible:Overworld",
+  "screenshot": "test-results/failure_12345678.png",
+  "widget_state": {
+    "visible_windows": ["Main Window", "Debug"],
+    "overworld_window": { "exists": true, "visible": false, "reason": "not_initialized" }
+  },
+  "execution_context": {
+    "last_click": "menuitem: Overworld Editor",
+    "frames_since_click": 150,
+    "resource_stats": { "memory_mb": 245, "framerate": 58.3 }
+  }
+}
+```
+
+**Benefits**:
+- Developers fix failing tests faster (visual + state context)
+- Flaky test debugging (see exact UI state at failure)
+- Test reports shareable with QA/PM (HTML with screenshots)
+
+#### IT-09: CI/CD Integration (2-3 hours)
+**Problem**: Tests run manually. No automated regression on PR/merge.
+
+**Solution**: Standardize test execution for CI:
+- YAML test suite format (groups, dependencies, parallel execution)
+- `z3ed test suite run` command with `--ci-mode`
+- JUnit XML output for CI parsers (Jenkins, GitHub Actions)
+- Exit codes: 0=pass, 1=fail, 2=error
+
+**GitHub Actions Example**:
+```yaml
+name: GUI Tests
+on: [push, pull_request]
+jobs:
+  gui-tests:
+    runs-on: macos-latest
+    steps:
+      - name: Build YAZE
+        run: cmake --build build --target yaze --target z3ed
+      - name: Start test harness
+        run: ./build/bin/yaze --enable_test_harness --headless &
+      - name: Run smoke tests
+        run: ./build/bin/z3ed test suite run tests/smoke.yaml --ci-mode
+      - name: Upload results
+        uses: actions/upload-artifact@v2
+        with:
+          name: test-results
+          path: test-results/
+```
+
+**Benefits**:
+- Catch regressions before merge
+- Test history tracked in CI dashboard
+- Parallel execution for faster feedback
+- Flaky test detection (retry logic, failure rates)
+
+### 9.4. Unified Testing Vision
+
+The enhanced test harness serves three audiences:
+
+**For AI Agents** (Generative Workflows):
+- Widget discovery enables dynamic learning
+- Test introspection provides reliable feedback loops
+- Recording captures expert workflows for training data
+
+**For Developers** (Unit/Integration Testing):
+- Test suites organize tests by scope (smoke, regression, nightly)
+- CI integration catches regressions early
+- Rich error reporting speeds up debugging
+
+**For QA Engineers** (Manual Testing Automation):
+- Record manual workflows once, replay forever
+- Parameterized tests reduce maintenance burden
+- Visual test reports simplify communication
+
+**Shared Infrastructure**:
+- Single gRPC server handles all test types
+- Consistent test script format (JSON/YAML)
+- Common result storage and reporting
+- Cross-platform support (macOS, Windows, Linux)
+
+### 9.5. Implementation Priority
+
+**Phase 1: Foundation** (Already Complete ✅)
+- Core automation RPCs (Ping, Click, Type, Wait, Assert)
+- ImGuiTestEngine integration
+- gRPC server lifecycle
+- Basic E2E validation
+
+**Phase 2: Introspection & Discovery** (IT-05, IT-06 - 10-14 hours)
+- Test status/results querying
+- Widget enumeration API
+- Async test management
+- *Critical for AI agents*
+
+**Phase 3: Recording & Replay** (IT-07 - 8-10 hours)
+- Test script format
+- Recording workflow
+- Replay engine
+- *Unlocks regression testing*
+
+**Phase 4: Production Readiness** (IT-08, IT-09 - 5-7 hours)
+- Screenshot implementation
+- Error context capture
+- CI/CD integration
+- *Enables automated pipelines*
+
+**Total Estimated Effort**: 23-31 hours beyond current implementation
+
+---
  - **Local Models (macOS Setup)**: For privacy, offline use, and reduced operational costs, integration with local LLMs via [Ollama](https://ollama.ai/) is a priority. Users can easily install Ollama on macOS and pull models optimized for code generation, such as `codellama:7b`. The `z3ed` agent will communicate with Ollama's local API endpoint.
  - **Remote Models (Gemini API)**: For more complex tasks requiring advanced reasoning capabilities, integration with powerful remote models like the Gemini API will be available. Users will need to provide a `GEMINI_API_KEY` environment variable. A new `GeminiAIService` class will be implemented to handle the secure API requests and responses.
 - **Protocol**: A robust, yet simple, JSON-based protocol will be used for communication between `z3ed` and the AI model. This ensures structured data exchange, critical for reliable parsing and execution. The `z3ed` tool will serialize the user's prompt, current ROM context, available `z3ed` commands, and any relevant `ImGuiTestEngine` capabilities into a JSON object. The AI model will be expected to return a JSON object containing the sequence of commands to be executed, along with potential explanations or confidence scores.