10 KiB
10 KiB
LLM Integration Implementation Checklist
Created: October 3, 2025
Status: Ready to Begin
Estimated Time: 12-15 hours total
📋 Main Guide: See LLM-INTEGRATION-PLAN.md for detailed implementation instructions.
Phase 1: Ollama Local Integration (4-6 hours) ✅ COMPLETE
Prerequisites
- Install Ollama:
brew install ollama(macOS) - Start Ollama server:
ollama serve - Pull recommended model:
ollama pull qwen2.5-coder:7b - Test connectivity:
curl http://localhost:11434/api/tags
Implementation Tasks
1.1 Create OllamaAIService Class
- Create
src/cli/service/ollama_ai_service.h- Define
OllamaConfigstruct - Declare
OllamaAIServiceclass withGetCommands()override - Add
CheckAvailability()andListAvailableModels()methods
- Define
- Create
src/cli/service/ollama_ai_service.cc- Implement constructor with config
- Implement
BuildSystemPrompt()with z3ed command documentation - Implement
CheckAvailability()with health check - Implement
GetCommands()with Ollama API call - Add JSON parsing for command extraction
- Add error handling for connection failures
1.2 Update CMake Configuration
- Add
YAZE_WITH_HTTPLIBoption toCMakeLists.txt - Add httplib detection (vcpkg or bundled)
- Add compile definition
YAZE_WITH_HTTPLIB - Update z3ed target to link httplib when available
1.3 Wire into Agent Commands
- Update
src/cli/handlers/agent/general_commands.cc- Add
#include "cli/service/ollama_ai_service.h" - Create
CreateAIService()helper function - Implement provider selection logic (env vars)
- Add health check with fallback to MockAIService
- Update
HandleRunCommand()to use service factory - Update
HandlePlanCommand()to use service factory
- Add
1.4 Testing & Validation
- Create
scripts/test_ollama_integration.sh- Check Ollama server availability
- Verify model is pulled
- Test
z3ed agent runwith simple prompt - Verify proposal creation
- Review generated commands
- Run end-to-end test
- Document any issues encountered
Success Criteria
z3ed agent run --prompt "Validate ROM"generates correct command- Health check reports clear errors when Ollama unavailable
- Service fallback to MockAIService works correctly
- Test script passes without manual intervention
Status: ✅ Complete - See PHASE1-COMPLETE.md
Phase 2: Improve Gemini Integration (2-3 hours) ✅ COMPLETE
Implementation Tasks
2.1 Fix GeminiAIService
- Update
src/cli/service/gemini_ai_service.h- Add
GeminiConfigstruct with model, temperature, max_tokens - Add health check methods
- Update constructor signature
- Add
- Update
src/cli/service/gemini_ai_service.cc- Fix system instruction format (separate field in v1beta API)
- Update to use
gemini-2.5-flashmodel - Add generation config (temperature, maxOutputTokens)
- Add
responseMimeType: application/jsonfor structured output - Implement markdown code block stripping
- Add
CheckAvailability()with API key validation - Improve error messages with actionable guidance
2.2 Wire into Service Factory
- Update
CreateAIService()to useGeminiConfig - Add Gemini health check with fallback
- Add
GEMINI_MODELenvironment variable support - Test with graceful fallback
2.3 Testing
- Create
scripts/test_gemini_integration.sh - Test graceful fallback without API key
- Test error handling (invalid key, network issues)
- Test with real API key (pending)
- Verify JSON array parsing (pending)
- Test various prompts (pending)
Success Criteria
- Gemini service compiles and builds
- Service factory integration works
- Graceful fallback to MockAIService
- Gemini generates valid command arrays (pending API key)
- Markdown stripping works reliably (pending API key)
- Error messages guide user to API key setup
Status: ✅ Complete (build & integration) - See PHASE2-COMPLETE.md
Pending: Real API key validation
Phase 3: Add Claude Integration (2-3 hours)
Implementation Tasks
3.1 Create ClaudeAIService
- Create
src/cli/service/claude_ai_service.h- Define class with API key constructor
- Add
GetCommands()override
- Create
src/cli/service/claude_ai_service.cc- Implement Claude Messages API call
- Use
claude-3-5-sonnet-20241022model - Add markdown stripping
- Add error handling
3.2 Wire into Service Factory
- Update
CreateAIService()to check forCLAUDE_API_KEY - Add Claude as provider option
3.3 Testing
- Test with various prompts
- Compare output quality vs Gemini/Ollama
Success Criteria
- Claude service works interchangeably with others
- Quality comparable or better than Gemini
Phase 4: Enhanced Prompt Engineering (3-4 hours) ✅ COMPLETE
Implementation Tasks
4.1 Create PromptBuilder Utility
- Create
src/cli/service/prompt_builder.h - Create
src/cli/service/prompt_builder.cc- Implement
LoadResourceCatalogue()(with hardcoded docs for now) - Implement
BuildSystemPrompt()with full command docs - Implement
BuildFewShotExamplesSection()with proven examples - Implement
BuildContextPrompt()with ROM state foundation - Add default few-shot examples (6+ examples)
- Add command documentation (palette, overworld, sprite, dungeon, rom)
- Add tile ID reference (tree, house, water, grass)
- Add constraints section (output format, syntax rules)
- Implement
4.2 Integrate into Services
- Update OllamaAIService to use PromptBuilder
- Add PromptBuilder include
- Add use_enhanced_prompting flag (default: true)
- Use BuildSystemInstructionWithExamples()
- Update GeminiAIService to use PromptBuilder
- Add PromptBuilder include
- Add use_enhanced_prompting flag (default: true)
- Use BuildSystemInstructionWithExamples()
- Update ClaudeAIService to use PromptBuilder (pending Phase 3)
4.3 Testing
- Create test script (test_enhanced_prompting.sh)
- Test with complex prompts (pending real API validation)
- Measure accuracy improvement (pending validation)
- Document which models perform best (pending validation)
Success Criteria
- PromptBuilder utility class implemented
- Few-shot examples included (6+ examples)
- Command documentation complete
- Tile ID reference included
- Integrated into Ollama & Gemini
- Enabled by default
- System prompts include full resource catalogue (pending yaml loading)
- Few-shot examples improve accuracy >90% (pending validation)
- Context injection provides relevant ROM info (foundation in place)
Status: ✅ Complete (implementation) - See PHASE4-COMPLETE.md
Pending: Real API validation to measure accuracy improvement
Configuration & Documentation
Environment Variables Setup
- Document
YAZE_AI_PROVIDERoptions - Document
OLLAMA_MODELoverride - Document API key requirements
- Create example
.envfile
User Documentation
- Create
docs/z3ed/AI-SERVICE-SETUP.md- Ollama quick start
- Gemini setup guide
- Claude setup guide
- Troubleshooting section
- Update README with LLM setup instructions
- Add examples to main docs
CLI Enhancements
- Add
--ai-providerflag to override env - Add
--ai-modelflag to override model - Add
--dry-runflag to see commands without executing - Add
--interactiveflag to confirm each command
Testing Matrix
| Provider | Model | Test Prompt | Expected Commands | Status |
|---|---|---|---|---|
| Ollama | qwen2.5-coder:7b | "Validate ROM" | ["rom validate --rom zelda3.sfc"] |
⬜ |
| Ollama | codellama:13b | "Export first palette" | ["palette export ..."] |
⬜ |
| Gemini | gemini-2.5-flash | "Make soldiers red" | ["palette export ...", "palette set-color ...", ...] |
⬜ |
| Claude | claude-3.5-sonnet | "Change tile at (10,20)" | ["overworld set-tile ..."] |
⬜ |
Rollout Plan
Week 1 (Oct 7-11, 2025)
- Monday: Phase 1 implementation (OllamaAIService class)
- Tuesday: Phase 1 CMake + wiring
- Wednesday: Phase 1 testing + documentation
- Thursday: Phase 2 (Gemini fixes)
- Friday: Buffer day + code review
Week 2 (Oct 14-18, 2025)
- Monday: Phase 3 (Claude integration)
- Tuesday: Phase 4 (PromptBuilder)
- Wednesday: Enhanced testing across all services
- Thursday: Documentation completion
- Friday: User validation + demos
Known Risks & Mitigation
| Risk | Impact | Likelihood | Mitigation |
|---|---|---|---|
| Ollama not available on CI | Medium | Low | Add YAZE_AI_PROVIDER=mock for CI builds |
| LLM output format inconsistent | High | Medium | Strict system prompts + validation layer |
| API rate limits | Medium | Medium | Cache responses, implement retry backoff |
| Model accuracy insufficient | High | Low | Multiple few-shot examples + prompt tuning |
Success Metrics
Phase 1 Complete:
- ✅ Ollama service operational on local machine
- ✅ Can generate valid z3ed commands from prompts
- ✅ End-to-end test passes
Phase 2-3 Complete:
- ✅ All three providers (Ollama, Gemini, Claude) work interchangeably
- ✅ Service selection transparent to user
Phase 4 Complete:
- ✅ Command accuracy >90% on standard prompts
- ✅ Resource catalogue integrated into system prompts
Production Ready:
- ✅ Documentation complete with setup guides
- ✅ Error messages are actionable
- ✅ Works on macOS (primary target)
- ✅ At least one user validates the workflow
Next Steps After Completion
- Gather User Feedback: Share with ROM hacking community
- Measure Accuracy: Track success rate of generated commands
- Model Comparison: Document which models work best
- Fine-Tuning: Consider fine-tuning local models on z3ed corpus
- Agentic Loop: Add self-correction based on execution results
Notes & Observations
Add notes here as you progress through implementation:
Last Updated: October 3, 2025
Next Review: After Phase 1 completion