doc: Plan test harness with introspection capabilities (IT-05)

2025-10-02 15:00:34 -04:00
parent fdead0e9e5
commit 3a573c0764
5 changed files with 1552 additions and 118 deletions
--- a/docs/z3ed/E6-z3ed-implementation-plan.md
+++ b/docs/z3ed/E6-z3ed-implementation-plan.md
@@ -1,9 +1,28 @@
-# z3ed Agentic Wo**Active Phase**:
- **Policy Evaluation Framework (AW-04)**: YAML-based constraint system for gating proposal acceptance - implementation complete, ready for production testing.
+# z3ed Agentic Workflow Plan
+
+**Last Updated**: October 2, 2025
+**Status**: Core Infrastructure Complete | Test Harness Enhancement Phase 🎯
+
+> 📋 **Quick Start**: See [README.md](README.md) for essential links and project status.
+
+## Executive Summary
+
+The z3ed CLI and AI agent workflow system has completed major infrastructure milestones:
+
+**✅ Completed Phases**:
+- **Phase 6**: Resource Catalogue - Machine-readable API specs for AI consumption
+- **AW-01/02/03**: Acceptance Workflow - Proposal tracking, sandbox management, GUI review with ROM merging
+- **AW-04**: Policy Evaluation Framework - YAML-based constraint system for proposal acceptance
+- **IT-01**: ImGuiTestHarness - Full GUI automation via gRPC + ImGuiTestEngine (all 3 phases complete)
+- **IT-02**: CLI Agent Test - Natural language → automated GUI testing (implementation complete)
+
+**🔄 Active Phase**:
+- **Test Harness Enhancements (IT-05 to IT-09)**: Expanding from basic automation to comprehensive testing platform

 **📋 Next Phases**:
- **Priority 1**: Production Testing - Validate policy enforcement with real ROM modification proposals.
- **Priority 2**: Windows Cross-Platform Testing - Ensure z3ed works on Windows targets with gRPC integration.
+- **Priority 1**: Test Introspection API (IT-05) - Enable test status querying and result polling
+- **Priority 2**: Widget Discovery API (IT-06) - AI agents enumerate available GUI interactions
+- **Priority 3**: Test Recording & Replay (IT-07) - Capture workflows for regression testing

 **Recent Accomplishments** (Updated: January 2025):
 - **✅ Policy Framework Complete**: PolicyEvaluator service fully integrated with ProposalDrawer GUI
@@ -20,49 +39,17 @@
 - **Build System**: Hardened CMake configuration with reliable gRPC integration
 - **Proposal Workflow**: Agentic proposal system fully operational (create, list, diff, review in GUI)

-**Known Limitations** (Non-Blocking):
- **Screenshot RPC**: Stub implementation (returns "not implemented" - planned for production phase)
- **Widget Naming**: Documentation needed for icon prefixes and naming conventions
+**Known Limitations & Improvement Opportunities**:
+- **Screenshot RPC**: Stub implementation → needs SDL_Surface capture + PNG encoding
+- **Test Introspection**: No way to query test status, results, or queue → add GetTestStatus/ListTests RPCs
+- **Widget Discovery**: AI agents can't enumerate available widgets → add DiscoverWidgets RPC
+- **Test Recording**: No record/replay for regression testing → add RecordSession/ReplaySession RPCs
+- **Synchronous Wait**: Async tests return immediately → add blocking mode or result polling
+- **Error Context**: Test failures lack screenshots/state dumps → enhance error reporting
 - **Performance**: Tests add ~166ms per Wait call due to frame yielding (acceptable trade-off)
 - **YAML Parsing**: Simple parser implemented, consider yaml-cpp for complex scenarios

-**Time Investment**: 28.5 hours total (IT-01: 11h, IT-02: 7.5h, E2E: 2h, Policy: 6h, Docs: 2h)on Plan
-
-**Last Updated**: [Current Date]
-**Status**: Core Infrastructure Complete | E2E Validation In Progress 🎯
-
-> 📋 **Quick Start**: See [README.md](README.md) for essential links and project status.
-
-## Executive Summary
-
-The z3ed CLI and AI agent workflow system has completed major infrastructure milestones:
-
-**✅ Completed Phases**:
- **Phase 6**: Resource Catalogue - Machine-readable API specs for AI consumption
- **AW-01/02/03**: Acceptance Workflow - Proposal tracking, sandbox management, GUI review with ROM merging
- **IT-01**: ImGuiTestHarness - Full GUI automation via gRPC + ImGuiTestEngine (all 3 phases complete)
- **IT-02**: CLI Agent Test - Natural language → automated GUI testing (implementation complete)
-
-**🔄 Active Phase**:
- **E2E Validation**: Testing complete proposal lifecycle with real GUI widgets (window detection debugging in progress)
-
-**📋 Next Phases**:
- **Priority 1**: Complete E2E Validation - Fix window detection after menu actions (2-3 hours)
- **Priority 2**: Policy Evaluation Framework (AW-04) - YAML-based constraints for proposal acceptance (6-8 hours)
-
-**Recent Accomplishments** (October 2, 2025):
- IT-02 implementation complete with async test queue pattern
- Build system fixes for z3ed target (gRPC integration)
- Documentation consolidated into clean structure
- E2E test script operational (5/6 RPCs working)
- Menu interaction verified via ImGuiTestEngine
-
-**Known Issues**:
- Window detection timing after menu clicks needs refinement
- Screenshot RPC proto mismatch (non-critical)
-
-**Time Investment**: 20.5 hours total (IT-01: 11h, IT-02: 7.5h, Docs: 2h)  
-**Code Quality**: All targets compile cleanly, no crashes, partial test coverage
+**Time Investment**: 28.5 hours total (IT-01: 11h, IT-02: 7.5h, E2E: 2h, Policy: 6h, Docs: 2h)

 ## Quick Reference

@@ -94,83 +81,326 @@ The z3ed CLI and AI agent workflow system has completed major infrastructure mil

 ## 1. Current Priorities (Week of Oct 2-8, 2025)

-**Status**: IT-01 Complete ✅ | IT-02 Complete ✅ | E2E Tests Running ⚡
+**Status**: Core Infrastructure Complete ✅ | Test Harness Enhancement Phase 🔧

-### Priority 0: E2E Test Validation (IMMEDIATE) 🎯
-**Goal**: Validate test harness with real YAZE widgets  
-**Time Estimate**: 30-60 minutes  
-**Status**: Test script running, needs real widget names
+### Priority 1: Test Harness Enhancements (IT-05 to IT-09) 🔧 ACTIVE
+**Goal**: Transform test harness from basic automation to comprehensive testing platform  
+**Time Estimate**: 20-25 hours total  
+**Blocking Dependency**: IT-01 Complete ✅

-**Current Results**: 
- ✅ Ping RPC working
- ⚠️ Tests 2-5 using fake widget names
- 📋 Need to identify real widget names from YAZE source
- 🔧 Screenshot RPC needs proto fix
-
-**Task Checklist**:
-1. ✅ **E2E Test Script**: Already created (`scripts/test_harness_e2e.sh`)
-2. 📋 **Manual Testing Workflow**:
-   - Start YAZE with test harness enabled
-   - Create proposal via CLI: `z3ed agent run "Test prompt" --sandbox`
-   - Verify proposal appears in ProposalDrawer GUI
-   - Test Accept → validate ROM merge and save prompt
-   - Test Reject → validate status update
-   - Test Delete → validate cleanup
-3. 📋 **Real Widget Testing**:
-   - Click actual YAZE buttons (Overworld, Dungeon, etc.)
-   - Type into real input fields
-   - Wait for actual windows to appear
-   - Assert on real widget states
-4. 📋 **Document Edge Cases**:
-   - Widget not found scenarios
-   - Timeout handling
-   - Error recovery patterns
-
-### Priority 2: CLI Agent Test Command (IT-02) 📋 NEXT
-**Goal**: Natural language → automated GUI testing via gRPC  
-**Time Estimate**: 4-6 hours  
-**Blocking Dependency**: Priority 1 completion
+**Motivation**: Current test harness supports basic GUI automation but lacks features for:
+- **AI Agent Development**: No widget discovery API for LLMs to learn available interactions
+- **Regression Testing**: No recording/replay mechanism for test suite management
+- **CI/CD Integration**: No standardized test format for automated pipelines
+- **Debugging**: Limited error context when tests fail (no screenshots, state dumps)
+- **Test Management**: Can't query test status, results, or execution queue

+#### IT-05: Test Introspection API (6-8 hours)
 **Implementation Tasks**:
-1. **Create `z3ed agent test` command**:
-   - Parse natural language prompt
-   - Generate RPC call sequence (Click → Wait → Assert)
-   - Execute via gRPC client
-   - Capture results and screenshots
+1. **Add GetTestStatus RPC**:
+   - Query status of queued/running tests by ID
+   - Return test state: queued, running, passed, failed, timeout
+   - Include execution time, error messages, assertion failures
   
-2. **Example Usage**:
-   ```bash
-   z3ed agent test --prompt "Open Overworld editor and verify it loads" \
-     --rom zelda3.sfc
+2. **Add ListTests RPC**:
+   - Enumerate all registered tests in ImGuiTestEngine
+   - Filter by category (grpc, unit, integration, e2e)
+   - Return test metadata: name, category, last run time, pass/fail count
   
-   # Generated workflow:
-   # 1. Click "button:Overworld"
-   # 2. Wait "window_visible:Overworld Editor" (5s)
-   # 3. Assert "visible:Overworld Editor"
-   # 4. Screenshot "full"
-   ```
+3. **Add GetTestResults RPC**:
+   - Retrieve detailed results for completed tests
+   - Include assertion logs, performance metrics, resource usage
+   - Support pagination for large result sets

-3. **Implementation Files**:
-   - `src/cli/handlers/agent.cc` - Add `HandleTestCommand()`
-   - `src/cli/service/gui_automation_client.{h,cc}` - gRPC client wrapper
-   - `src/cli/service/test_workflow_generator.{h,cc}` - Prompt → RPC translator
+**Example Usage**:
+```bash
+# Queue a test
+z3ed agent test --prompt "Open Overworld editor"

-### Priority 3: Policy Evaluation Framework (AW-04) 📋
-**Goal**: YAML-based constraint system for gating proposal acceptance  
-**Time Estimate**: 6-8 hours  
-**Blocking Dependency**: None (can work in parallel)
+# Poll for completion
+z3ed test status --test-id grpc_click_12345678

-> <20> **Detailed Guides**: See [NEXT_PRIORITIES_OCT2.md](NEXT_PRIORITIES_OCT2.md) for complete implementation breakdowns with code examples.
+# Retrieve results
+z3ed test results --test-id grpc_click_12345678 --format json
+```

---
+**API Schema**:
+```proto
+message GetTestStatusRequest {
+  string test_id = 1;
+}

-## 2. Workstreams Overview
+message GetTestStatusResponse {
+  enum Status { QUEUED = 0; RUNNING = 1; PASSED = 2; FAILED = 3; TIMEOUT = 4; }
+  Status status = 1;
+  int64 execution_time_ms = 2;
+  string error_message = 3;
+  repeated string assertion_failures = 4;
+}
+
+message ListTestsRequest {
+  string category_filter = 1;  // Optional: "grpc", "unit", etc.
+  int32 page_size = 2;
+  string page_token = 3;
+}
+
+message ListTestsResponse {
+  repeated TestInfo tests = 1;
+  string next_page_token = 2;
+}
+
+message TestInfo {
+  string test_id = 1;
+  string name = 2;
+  string category = 3;
+  int64 last_run_timestamp_ms = 4;
+  int32 total_runs = 5;
+  int32 pass_count = 6;
+  int32 fail_count = 7;
+}
+```
+
+#### IT-06: Widget Discovery API (4-6 hours)
+**Implementation Tasks**:
+1. **Add DiscoverWidgets RPC**:
+   - Enumerate all windows currently open in YAZE GUI
+   - List all interactive widgets (buttons, inputs, menus, tabs) per window
+   - Return widget metadata: ID, type, label, enabled state, position
+   - Support filtering by window name or widget type
+   
+2. **AI-Friendly Output Format**:
+   - JSON schema describing available interactions
+   - Natural language descriptions for each widget
+   - Suggested action templates (e.g., "Click button:{label}")
+
+**Example Usage**:
+```bash
+# Discover all widgets
+z3ed gui discover
+
+# Filter by window
+z3ed gui discover --window "Overworld"
+
+# Get only buttons
+z3ed gui discover --type button
+```
+
+**API Schema**:
+```proto
+message DiscoverWidgetsRequest {
+  string window_filter = 1;  // Optional: filter by window name
+  enum WidgetType { ALL = 0; BUTTON = 1; INPUT = 2; MENU = 3; TAB = 4; CHECKBOX = 5; }
+  WidgetType type_filter = 2;
+}
+
+message DiscoverWidgetsResponse {
+  repeated WindowInfo windows = 1;
+}
+
+message WindowInfo {
+  string name = 1;
+  bool is_visible = 2;
+  repeated WidgetInfo widgets = 3;
+}
+
+message WidgetInfo {
+  string id = 1;
+  string label = 2;
+  string type = 3;  // "button", "input", "menu", etc.
+  bool is_enabled = 4;
+  string position = 5;  // "x,y,width,height"
+  string suggested_action = 6;  // "Click button:Open ROM"
+}
+```
+
+**Benefits for AI Agents**:
+- LLMs can dynamically learn available GUI interactions
+- Agents can adapt to UI changes without hardcoded widget names
+- Natural language descriptions enable better prompt engineering
+
+#### IT-07: Test Recording & Replay (8-10 hours)
+**Implementation Tasks**:
+1. **Add StartRecording/StopRecording RPCs**:
+   - Capture all RPC calls during a session
+   - Record timing, parameters, and results
+   - Save to JSON test script format
+   
+2. **Add ReplayTest RPC**:
+   - Load JSON test script
+   - Execute recorded actions sequentially
+   - Validate expected results match actual results
+   - Support parameterization (e.g., replace ROM filename)
+   
+3. **Test Script Format**:
+   - Human-readable JSON with comments
+   - Support assertions and conditionals
+   - Enable test suite composition (call other scripts)
+
+**Example Workflow**:
+```bash
+# Start recording
+z3ed test record start --output overworld_test.json
+
+# Perform actions (manually or via agent)
+z3ed agent test --prompt "Open Overworld editor"
+z3ed agent test --prompt "Click tile at 10,20"
+
+# Stop recording
+z3ed test record stop
+
+# Replay test
+z3ed test replay overworld_test.json
+
+# Run in CI
+z3ed test replay tests/*.json --ci-mode
+```
+
+**JSON Test Script Example**:
+```json
+{
+  "name": "Overworld Editor Load Test",
+  "description": "Verify Overworld editor opens and tile selection works",
+  "steps": [
+    {
+      "action": "Click",
+      "target": "menuitem: Overworld Editor",
+      "expected_result": { "success": true }
+    },
+    {
+      "action": "Wait",
+      "condition": "window_visible:Overworld",
+      "timeout_ms": 5000
+    },
+    {
+      "action": "Assert",
+      "condition": "visible:Overworld",
+      "expected": { "success": true, "actual_value": "visible" }
+    }
+  ]
+}
+```
+
+#### IT-08: Enhanced Error Reporting (3-4 hours)
+**Implementation Tasks**:
+1. **Screenshot on Failure**:
+   - Implement Screenshot RPC (complete stub)
+   - Automatically capture screenshot when test fails
+   - Save to proposal directory or test results folder
+   
+2. **Widget State Dumps**:
+   - Capture full widget tree on assertion failure
+   - Include widget properties (enabled, visible, position, text)
+   - Generate HTML report with annotated screenshots
+   
+3. **Execution Context**:
+   - Log ImGui state: active window, focused widget, frame count
+   - Capture recent ImGui events (clicks, key presses, hovers)
+   - Include resource stats: memory, textures, framerate
+
+**Error Report Example**:
+```json
+{
+  "test_id": "grpc_assert_12345678",
+  "failure_time": "2025-10-02T14:23:45Z",
+  "assertion": "visible:Overworld",
+  "expected": "visible",
+  "actual": "hidden",
+  "screenshot": "/tmp/yaze_test_12345678.png",
+  "widget_state": {
+    "active_window": "Main Window",
+    "focused_widget": null,
+    "visible_windows": ["Main Window", "Debug"],
+    "overworld_window": { "exists": true, "visible": false, "position": "0,0,0,0" }
+  },
+  "execution_context": {
+    "frame_count": 1234,
+    "recent_events": ["Click: menuitem: Overworld Editor", "Wait: window_visible:Overworld"],
+    "resource_stats": { "memory_mb": 245, "textures": 12, "framerate": 60.0 }
+  }
+}
+```
+
+#### IT-09: CI/CD Integration (2-3 hours)
+**Implementation Tasks**:
+1. **Standardized Test Suite Format**:
+   - YAML/JSON format for test suite definitions
+   - Support test groups (smoke, regression, nightly)
+   - Enable parallel execution with dependencies
+   
+2. **CI-Friendly CLI**:
+   - `z3ed test run-suite tests/suite.yaml --ci-mode`
+   - Exit codes: 0 = all passed, 1 = failures, 2 = errors
+   - JUnit XML output for CI parsers
+   - GitHub Actions integration examples
+   
+3. **Documentation**:
+   - Add `.github/workflows/gui-tests.yml` example
+   - Create sample test suites for common scenarios
+   - Document best practices for flaky test handling
+
+**Test Suite Format**:
+```yaml
+name: YAZE GUI Test Suite
+description: Comprehensive tests for YAZE editor functionality
+version: 1.0
+
+config:
+  timeout_per_test: 30s
+  retry_on_failure: 2
+  parallel_execution: false
+
+test_groups:
+  - name: smoke
+    description: Fast tests for basic functionality
+    tests:
+      - tests/overworld_load.json
+      - tests/dungeon_load.json
+  
+  - name: regression
+    description: Full test suite for release validation
+    depends_on: [smoke]
+    tests:
+      - tests/palette_edit.json
+      - tests/sprite_load.json
+      - tests/rom_save.json
+```
+
+**GitHub Actions Integration**:
+```yaml
+name: GUI Tests
+on: [push, pull_request]
+
+jobs:
+  gui-tests:
+    runs-on: macos-latest
+    steps:
+      - uses: actions/checkout@v2
+      - name: Build YAZE with test harness
+        run: |
+          cmake -B build -DYAZE_WITH_GRPC=ON
+          cmake --build build --target yaze --target z3ed
+      - name: Start test harness
+        run: |
+          ./build/bin/yaze --enable_test_harness --headless &
+          sleep 5
+      - name: Run test suite
+        run: |
+          ./build/bin/z3ed test run-suite tests/suite.yaml --ci-mode
+      - name: Upload test results
+        if: always()
+        uses: actions/upload-artifact@v2
+        with:
+          name: test-results
+          path: test-results/
+```
+
+### Priority 2: Windows Cross-Platform Testing 🪟
+**Goal**: Validate z3ed and test harness on Windows  
+**Time Estimate**: 8-10 hours  
+**Blocking Dependency**: IT-05 Complete (need stable API)
+
+> 📋 **Detailed Guides**: See [NEXT_PRIORITIES_OCT2.md](NEXT_PRIORITIES_OCT2.md) for complete implementation breakdowns with code examples.

-This plan decomposes the design additions into actionable engineering tasks. Each workstream contains milestones, blocking dependencies, and expected deliverables.
-1. `src/cli/handlers/rom.cc` - Added `RomInfo::Run` implementation
-2. `src/cli/z3ed.h` - Added `RomInfo` class declaration  
-3. `src/cli/modern_cli.cc` - Updated `HandleRomInfoCommand` routing
-4. `src/cli/service/resource_catalog.cc` - Added `rom info` schema entry
 ---

 ## 2. Workstreams Overview
@@ -225,6 +455,11 @@ This plan decomposes the design additions into actionable engineering tasks. Eac
 | IT-02 | Implement CLI agent step translation (`imgui_action` → harness call). | ImGuiTest Bridge | Code | ✅ Done | `z3ed agent test` command with natural language prompts (7.5 hours) |
 | IT-03 | Provide synchronization primitives (`WaitForIdle`, etc.). | ImGuiTest Bridge | Code | ✅ Done | Wait RPC with condition polling already implemented in IT-01 Phase 3 |
 | IT-04 | Complete E2E validation with real YAZE widgets | ImGuiTest Bridge | Test | ✅ Done | IT-02 - All 5 functional tests passing, window detection fixed with yield buffer |
+| IT-05 | Add test introspection RPCs (GetTestStatus, ListTests, GetResults) | ImGuiTest Bridge | Code | 📋 Planned | IT-01 - Enable clients to poll test results and query execution state |
+| IT-06 | Implement widget discovery API for AI agents | ImGuiTest Bridge | Code | 📋 Planned | IT-01 - DiscoverWidgets RPC to enumerate windows, buttons, inputs |
+| IT-07 | Add test recording/replay for regression testing | ImGuiTest Bridge | Code | 📋 Planned | IT-05 - RecordSession/ReplaySession RPCs with JSON test scripts |
+| IT-08 | Enhance error reporting with screenshots and state dumps | ImGuiTest Bridge | Code | 📋 Planned | IT-01 - Capture widget state on failure for debugging |
+| IT-09 | Create standardized test suite format for CI integration | ImGuiTest Bridge | Infra | 📋 Planned | IT-07 - JSON/YAML test suite format compatible with CI/CD pipelines |
 | VP-01 | Expand CLI unit tests for new commands and sandbox flow. | Verification Pipeline | Test | 📋 Planned | RC/AW tasks |
 | VP-02 | Add harness integration tests with replay scripts. | Verification Pipeline | Test | 📋 Planned | IT tasks |
 | VP-03 | Create CI job running agent smoke tests with `YAZE_WITH_JSON`. | Verification Pipeline | Infra | 📋 Planned | VP-01, VP-02 |
@@ -234,10 +469,10 @@ This plan decomposes the design additions into actionable engineering tasks. Eac
 _Status Legend: 🔄 Active · 📋 Planned · ✅ Done_

 **Progress Summary**:
- ✅ Completed: 11 tasks (61%)
- 🔄 Active: 1 task (6%)
- 📋 Planned: 6 tasks (33%)
- **Total**: 18 tasks
+- ✅ Completed: 11 tasks (48%)
+- 🔄 Active: 1 task (4%)
+- 📋 Planned: 11 tasks (48%)
+- **Total**: 23 tasks (5 new test harness enhancements added)

 ## 3. Immediate Next Steps (Week of Oct 1-7, 2025)