scawful/yaze

Fork 0

Files

scawful ba12075ca9 Upgrade gemini model to 2.5-flash

2025-10-03 01:34:11 -04:00

10 KiB

Raw Blame History

LLM Integration Implementation Checklist

Created: October 3, 2025
Status: Ready to Begin
Estimated Time: 12-15 hours total

📋 Main Guide: See LLM-INTEGRATION-PLAN.md for detailed implementation instructions.

Phase 1: Ollama Local Integration (4-6 hours) ✅ COMPLETE

Prerequisites

Install Ollama: brew install ollama (macOS)
Start Ollama server: ollama serve
Pull recommended model: ollama pull qwen2.5-coder:7b
Test connectivity: curl http://localhost:11434/api/tags

Implementation Tasks

1.1 Create OllamaAIService Class

Create src/cli/service/ollama_ai_service.h
- Define OllamaConfig struct
- Declare OllamaAIService class with GetCommands() override
- Add CheckAvailability() and ListAvailableModels() methods
Create src/cli/service/ollama_ai_service.cc
- Implement constructor with config
- Implement BuildSystemPrompt() with z3ed command documentation
- Implement CheckAvailability() with health check
- Implement GetCommands() with Ollama API call
- Add JSON parsing for command extraction
- Add error handling for connection failures

1.2 Update CMake Configuration

Add YAZE_WITH_HTTPLIB option to CMakeLists.txt
Add httplib detection (vcpkg or bundled)
Add compile definition YAZE_WITH_HTTPLIB
Update z3ed target to link httplib when available

1.3 Wire into Agent Commands

Update src/cli/handlers/agent/general_commands.cc
- Add #include "cli/service/ollama_ai_service.h"
- Create CreateAIService() helper function
- Implement provider selection logic (env vars)
- Add health check with fallback to MockAIService
- Update HandleRunCommand() to use service factory
- Update HandlePlanCommand() to use service factory

1.4 Testing & Validation

Create scripts/test_ollama_integration.sh
- Check Ollama server availability
- Verify model is pulled
- Test z3ed agent run with simple prompt
- Verify proposal creation
- Review generated commands
Run end-to-end test
Document any issues encountered

Success Criteria

z3ed agent run --prompt "Validate ROM" generates correct command
Health check reports clear errors when Ollama unavailable
Service fallback to MockAIService works correctly
Test script passes without manual intervention

Status: ✅ Complete - See PHASE1-COMPLETE.md

Phase 2: Improve Gemini Integration (2-3 hours) ✅ COMPLETE

Implementation Tasks

2.1 Fix GeminiAIService

Update src/cli/service/gemini_ai_service.h
- Add GeminiConfig struct with model, temperature, max_tokens
- Add health check methods
- Update constructor signature
Update src/cli/service/gemini_ai_service.cc
- Fix system instruction format (separate field in v1beta API)
- Update to use gemini-2.5-flash model
- Add generation config (temperature, maxOutputTokens)
- Add responseMimeType: application/json for structured output
- Implement markdown code block stripping
- Add CheckAvailability() with API key validation
- Improve error messages with actionable guidance

2.2 Wire into Service Factory

Update CreateAIService() to use GeminiConfig
Add Gemini health check with fallback
Add GEMINI_MODEL environment variable support
Test with graceful fallback

2.3 Testing

Create scripts/test_gemini_integration.sh
Test graceful fallback without API key
Test error handling (invalid key, network issues)
Test with real API key (pending)
Verify JSON array parsing (pending)
Test various prompts (pending)

Success Criteria

Gemini service compiles and builds
Service factory integration works
Graceful fallback to MockAIService
Gemini generates valid command arrays (pending API key)
Markdown stripping works reliably (pending API key)
Error messages guide user to API key setup

Status: ✅ Complete (build & integration) - See PHASE2-COMPLETE.md
Pending: Real API key validation

Phase 3: Add Claude Integration (2-3 hours)

Implementation Tasks

3.1 Create ClaudeAIService

Create src/cli/service/claude_ai_service.h
- Define class with API key constructor
- Add GetCommands() override
Create src/cli/service/claude_ai_service.cc
- Implement Claude Messages API call
- Use claude-3-5-sonnet-20241022 model
- Add markdown stripping
- Add error handling

3.2 Wire into Service Factory

Update CreateAIService() to check for CLAUDE_API_KEY
Add Claude as provider option

3.3 Testing

Test with various prompts
Compare output quality vs Gemini/Ollama

Success Criteria

Claude service works interchangeably with others
Quality comparable or better than Gemini

Phase 4: Enhanced Prompt Engineering (3-4 hours) ✅ COMPLETE

Implementation Tasks

4.1 Create PromptBuilder Utility

Create src/cli/service/prompt_builder.h
Create src/cli/service/prompt_builder.cc
- Implement LoadResourceCatalogue() (with hardcoded docs for now)
- Implement BuildSystemPrompt() with full command docs
- Implement BuildFewShotExamplesSection() with proven examples
- Implement BuildContextPrompt() with ROM state foundation
- Add default few-shot examples (6+ examples)
- Add command documentation (palette, overworld, sprite, dungeon, rom)
- Add tile ID reference (tree, house, water, grass)
- Add constraints section (output format, syntax rules)

4.2 Integrate into Services

Update OllamaAIService to use PromptBuilder
- Add PromptBuilder include
- Add use_enhanced_prompting flag (default: true)
- Use BuildSystemInstructionWithExamples()
Update GeminiAIService to use PromptBuilder
- Add PromptBuilder include
- Add use_enhanced_prompting flag (default: true)
- Use BuildSystemInstructionWithExamples()
Update ClaudeAIService to use PromptBuilder (pending Phase 3)

4.3 Testing

Create test script (test_enhanced_prompting.sh)
Test with complex prompts (pending real API validation)
Measure accuracy improvement (pending validation)
Document which models perform best (pending validation)

Success Criteria

PromptBuilder utility class implemented
Few-shot examples included (6+ examples)
Command documentation complete
Tile ID reference included
Integrated into Ollama & Gemini
Enabled by default
System prompts include full resource catalogue (pending yaml loading)
Few-shot examples improve accuracy >90% (pending validation)
Context injection provides relevant ROM info (foundation in place)

Status: ✅ Complete (implementation) - See PHASE4-COMPLETE.md
Pending: Real API validation to measure accuracy improvement

Configuration & Documentation

Environment Variables Setup

Document YAZE_AI_PROVIDER options
Document OLLAMA_MODEL override
Document API key requirements
Create example .env file

User Documentation

Create docs/z3ed/AI-SERVICE-SETUP.md
- Ollama quick start
- Gemini setup guide
- Claude setup guide
- Troubleshooting section
Update README with LLM setup instructions
Add examples to main docs

CLI Enhancements

Add --ai-provider flag to override env
Add --ai-model flag to override model
Add --dry-run flag to see commands without executing
Add --interactive flag to confirm each command

Testing Matrix

Provider	Model	Test Prompt	Expected Commands	Status
Ollama	qwen2.5-coder:7b	"Validate ROM"	`["rom validate --rom zelda3.sfc"]`	⬜
Ollama	codellama:13b	"Export first palette"	`["palette export ..."]`	⬜
Gemini	gemini-2.5-flash	"Make soldiers red"	`["palette export ...", "palette set-color ...", ...]`	⬜
Claude	claude-3.5-sonnet	"Change tile at (10,20)"	`["overworld set-tile ..."]`	⬜

Rollout Plan

Week 1 (Oct 7-11, 2025)

Monday: Phase 1 implementation (OllamaAIService class)
Tuesday: Phase 1 CMake + wiring
Wednesday: Phase 1 testing + documentation
Thursday: Phase 2 (Gemini fixes)
Friday: Buffer day + code review

Week 2 (Oct 14-18, 2025)

Monday: Phase 3 (Claude integration)
Tuesday: Phase 4 (PromptBuilder)
Wednesday: Enhanced testing across all services
Thursday: Documentation completion
Friday: User validation + demos

Known Risks & Mitigation

Risk	Impact	Likelihood	Mitigation
Ollama not available on CI	Medium	Low	Add `YAZE_AI_PROVIDER=mock` for CI builds
LLM output format inconsistent	High	Medium	Strict system prompts + validation layer
API rate limits	Medium	Medium	Cache responses, implement retry backoff
Model accuracy insufficient	High	Low	Multiple few-shot examples + prompt tuning

Success Metrics

Phase 1 Complete:

✅ Ollama service operational on local machine
✅ Can generate valid z3ed commands from prompts
✅ End-to-end test passes

Phase 2-3 Complete:

✅ All three providers (Ollama, Gemini, Claude) work interchangeably
✅ Service selection transparent to user

Phase 4 Complete:

✅ Command accuracy >90% on standard prompts
✅ Resource catalogue integrated into system prompts

Production Ready:

✅ Documentation complete with setup guides
✅ Error messages are actionable
✅ Works on macOS (primary target)
✅ At least one user validates the workflow

Next Steps After Completion

Gather User Feedback: Share with ROM hacking community
Measure Accuracy: Track success rate of generated commands
Model Comparison: Document which models work best
Fine-Tuning: Consider fine-tuning local models on z3ed corpus
Agentic Loop: Add self-correction based on execution results

Notes & Observations

Add notes here as you progress through implementation:

Last Updated: October 3, 2025
Next Review: After Phase 1 completion

10 KiB Raw Blame History

LLM Integration Implementation Checklist

Phase 1: Ollama Local Integration (4-6 hours) ✅ COMPLETE

Prerequisites

Implementation Tasks

1.1 Create OllamaAIService Class

1.2 Update CMake Configuration

1.3 Wire into Agent Commands

1.4 Testing & Validation

Success Criteria

Phase 2: Improve Gemini Integration (2-3 hours) ✅ COMPLETE

Implementation Tasks

2.1 Fix GeminiAIService

2.2 Wire into Service Factory

2.3 Testing

Success Criteria

Phase 3: Add Claude Integration (2-3 hours)

Implementation Tasks

3.1 Create ClaudeAIService

3.2 Wire into Service Factory

3.3 Testing

Success Criteria

Phase 4: Enhanced Prompt Engineering (3-4 hours) ✅ COMPLETE

Implementation Tasks

4.1 Create PromptBuilder Utility

4.2 Integrate into Services

4.3 Testing

Success Criteria

Configuration & Documentation

Environment Variables Setup

User Documentation

CLI Enhancements

Testing Matrix

Rollout Plan

Week 1 (Oct 7-11, 2025)

Week 2 (Oct 14-18, 2025)

Known Risks & Mitigation

Success Metrics

Next Steps After Completion

Notes & Observations

10 KiB

Raw Blame History