backend-infra-engineer: Post v0.3.9-hotfix7 snapshot (build cleanup)

This commit is contained in:
scawful
2025-12-22 00:20:49 +00:00
parent 2934c82b75
commit 5c4cd57ff8
1259 changed files with 239160 additions and 43801 deletions

View File

@@ -151,7 +151,7 @@ Validates CMake configuration by checking targets, flags, and platform-specific
cmake -P scripts/validate-cmake-config.cmake
# Validate specific build directory
cmake -P scripts/validate-cmake-config.cmake build_ai
cmake -P scripts/validate-cmake-config.cmake build
```
**What it checks:**
@@ -171,7 +171,7 @@ Validates include paths in compile_commands.json to catch missing includes befor
./scripts/check-include-paths.sh
# Check specific build
./scripts/check-include-paths.sh build_ai
./scripts/check-include-paths.sh build
# Verbose mode (show all include dirs)
VERBOSE=1 ./scripts/check-include-paths.sh build
@@ -403,3 +403,164 @@ inline void ProcessData() { /* ... */ }
Full documentation available in:
- [docs/internal/testing/symbol-conflict-detection.md](../docs/internal/testing/symbol-conflict-detection.md)
- [docs/internal/testing/sample-symbol-database.json](../docs/internal/testing/sample-symbol-database.json)
## AI Model Evaluation Suite
Tools for evaluating and comparing AI models used with the z3ed CLI agent system. Located in `scripts/ai/`.
### Quick Start
```bash
# Run a quick smoke test
./scripts/ai/run-model-eval.sh --quick
# Evaluate specific models
./scripts/ai/run-model-eval.sh --models llama3.2,qwen2.5-coder
# Evaluate all available models
./scripts/ai/run-model-eval.sh --all
# Evaluate with comparison report
./scripts/ai/run-model-eval.sh --default --compare
```
### Components
#### run-model-eval.sh
Main entry point script. Handles prerequisites checking, model pulling, and orchestrates the evaluation.
**Options:**
- `--models, -m LIST` - Comma-separated list of models to evaluate
- `--all` - Evaluate all available Ollama models
- `--default` - Evaluate default models from config (llama3.2, qwen2.5-coder, etc.)
- `--tasks, -t LIST` - Task categories: rom_inspection, code_analysis, tool_calling, conversation
- `--timeout SEC` - Timeout per task (default: 120)
- `--quick` - Quick smoke test (single model, fewer tasks)
- `--compare` - Generate comparison report after evaluation
- `--dry-run` - Show what would run without executing
#### eval-runner.py
Python evaluation engine that runs tasks against models and scores responses.
**Features:**
- Multi-model evaluation
- Pattern-based accuracy scoring
- Response completeness analysis
- Tool usage detection
- Response time measurement
- JSON output for analysis
**Direct usage:**
```bash
python scripts/ai/eval-runner.py \
--models llama3.2,qwen2.5-coder \
--tasks all \
--output results/eval-$(date +%Y%m%d).json
```
#### compare-models.py
Generates comparison reports from evaluation results.
**Formats:**
- `--format table` - ASCII table (default)
- `--format markdown` - Markdown with analysis
- `--format json` - Machine-readable JSON
**Usage:**
```bash
# Compare all recent evaluations
python scripts/ai/compare-models.py results/eval-*.json
# Generate markdown report
python scripts/ai/compare-models.py --format markdown --output report.md results/*.json
# Get best model name (for scripting)
BEST_MODEL=$(python scripts/ai/compare-models.py --best results/eval-*.json)
```
#### eval-tasks.yaml
Task definitions and scoring configuration. Categories:
| Category | Description | Example Tasks |
|----------|-------------|---------------|
| rom_inspection | ROM data structure queries | List dungeons, describe maps |
| code_analysis | Code understanding tasks | Explain functions, find bugs |
| tool_calling | Tool usage evaluation | File operations, build commands |
| conversation | Multi-turn dialog | Follow-ups, clarifications |
**Scoring dimensions:**
- **Accuracy** (40%): Pattern matching against expected responses
- **Completeness** (30%): Response depth and structure
- **Tool Usage** (20%): Appropriate tool selection
- **Response Time** (10%): Speed (normalized to 0-10)
### Output
Results are saved to `scripts/ai/results/`:
- `eval-YYYYMMDD-HHMMSS.json` - Individual evaluation results
- `comparison-YYYYMMDD-HHMMSS.md` - Comparison reports
**Sample output:**
```
┌──────────────────────────────────────────────────────────────────────┐
│ YAZE AI Model Evaluation Report │
├──────────────────────────────────────────────────────────────────────┤
│ Model │ Accuracy │ Tool Use │ Speed │ Runs │
├──────────────────────────────────────────────────────────────────────┤
│ qwen2.5-coder:7b │ 8.8/10 │ 9.2/10 │ 2.1s │ 3 │
│ llama3.2:latest │ 7.9/10 │ 7.5/10 │ 2.3s │ 3 │
│ codellama:7b │ 7.2/10 │ 8.1/10 │ 2.8s │ 3 │
├──────────────────────────────────────────────────────────────────────┤
│ Recommended: qwen2.5-coder:7b (score: 8.7/10) │
└──────────────────────────────────────────────────────────────────────┘
```
### Prerequisites
- **Ollama**: Install from https://ollama.ai
- **Python 3.10+** with `requests` and `pyyaml`:
```bash
pip install requests pyyaml
```
- **At least one model pulled**:
```bash
ollama pull llama3.2
```
### Adding Custom Tasks
Edit `scripts/ai/eval-tasks.yaml` to add new evaluation tasks:
```yaml
categories:
custom_category:
description: "My custom tasks"
tasks:
- id: "my_task"
name: "My Task Name"
prompt: "What is the purpose of..."
expected_patterns:
- "expected|keyword|pattern"
required_tool: null
scoring:
accuracy_criteria: "Must mention X, Y, Z"
completeness_criteria: "Should include examples"
```
### Integration with CI
The evaluation suite can be integrated into CI pipelines:
```yaml
# .github/workflows/ai-eval.yml
- name: Run AI Evaluation
run: |
ollama serve &
sleep 5
ollama pull llama3.2
./scripts/ai/run-model-eval.sh --models llama3.2 --tasks tool_calling
```