# CI/CD Improvements Proposal ## Executive Summary This document proposes specific improvements to the YAZE CI/CD pipeline to catch build failures earlier, reduce wasted CI time, and provide faster feedback to developers. **Goals**: - Reduce time-to-first-failure from ~15 minutes to <5 minutes - Catch 90% of failures in fast jobs (<5 min) - Reduce PR iteration time from hours to minutes - Prevent platform-specific issues from reaching CI **ROI**: - **Time Saved**: ~10 minutes per failed build × ~30 failures/month = **5 hours/month** - **Developer Experience**: Faster feedback → less context switching - **CI Cost**: Minimal (fast jobs use fewer resources) --- ## Current CI Pipeline Analysis ### Current Jobs | Job | Platform | Duration | Cost | Catches | |-----|----------|----------|------|---------| | build | Ubuntu/macOS/Windows | 15-20 min | High | Compilation errors | | test | Ubuntu/macOS/Windows | 5 min | Medium | Test failures | | windows-agent | Windows | 30 min | High | AI stack issues | | code-quality | Ubuntu | 2 min | Low | Format/lint issues | | memory-sanitizer | Ubuntu | 20 min | High | Memory bugs | | z3ed-agent-test | macOS | 15 min | High | Agent integration | **Total PR Time**: ~40 minutes (parallel), ~90 minutes (worst case) ### Issues with Current Pipeline 1. **Long feedback loop**: 15-20 minutes to find out if headers are missing 2. **Wasted resources**: Full 20-minute builds that fail in first 2 minutes 3. **No early validation**: CMake configuration succeeds, but compilation fails later 4. **Symbol conflicts detected late**: Link errors only appear after full compile 5. **Platform-specific issues**: Discovered after 15+ minutes per platform --- ## Proposed Improvements ### Improvement 1: Configuration Validation Job **Goal**: Catch CMake errors in <2 minutes **Implementation**: ```yaml config-validation: name: "Config Validation - ${{ matrix.preset }}" runs-on: ${{ matrix.os }} strategy: fail-fast: true # Stop immediately if any fails matrix: include: - os: ubuntu-22.04 preset: ci-linux - os: macos-14 preset: ci-macos - os: windows-2022 preset: ci-windows steps: - uses: actions/checkout@v4 with: submodules: recursive - name: Setup build environment uses: ./.github/actions/setup-build with: platform: ${{ matrix.platform }} preset: ${{ matrix.preset }} - name: Validate CMake configuration run: | cmake --preset ${{ matrix.preset }} \ -DCMAKE_VERBOSE_MAKEFILE=OFF - name: Check include paths run: | grep "INCLUDE_DIRECTORIES" build/CMakeCache.txt || \ (echo "Include paths not configured" && exit 1) - name: Validate presets run: cmake --preset ${{ matrix.preset }} --list-presets ``` **Benefits**: - ✅ Fails in <2 minutes for CMake errors - ✅ Catches missing dependencies immediately - ✅ Validates include path propagation - ✅ Low resource usage (no compilation) **What it catches**: - CMake syntax errors - Missing dependencies (immediate) - Invalid preset definitions - Include path misconfiguration --- ### Improvement 2: Compile-Only Job **Goal**: Catch compilation errors in <5 minutes **Implementation**: ```yaml compile-check: name: "Compile Check - ${{ matrix.preset }}" runs-on: ${{ matrix.os }} needs: [config-validation] # Run after config validation passes strategy: fail-fast: false matrix: include: - os: ubuntu-22.04 preset: ci-linux platform: linux - os: macos-14 preset: ci-macos platform: macos - os: windows-2022 preset: ci-windows platform: windows steps: - uses: actions/checkout@v4 with: submodules: recursive - name: Setup build environment uses: ./.github/actions/setup-build with: platform: ${{ matrix.platform }} preset: ${{ matrix.preset }} - name: Configure project run: cmake --preset ${{ matrix.preset }} - name: Compile representative files run: | # Compile 10-20 key files to catch most header issues cmake --build build --target rom.cc.o bitmap.cc.o \ overworld.cc.o resource_catalog.cc.o \ dungeon.cc.o sprite.cc.o palette.cc.o \ asar_wrapper.cc.o controller.cc.o canvas.cc.o \ --parallel 4 - name: Check for common issues run: | # Platform-specific checks if [ "${{ matrix.platform }}" = "windows" ]; then echo "Checking for /std:c++latest flag..." grep "std:c++latest" build/compile_commands.json || \ echo "Warning: C++20 flag may be missing" fi ``` **Benefits**: - ✅ Catches header issues in ~5 minutes - ✅ Tests actual compilation without full build - ✅ Platform-specific early detection - ✅ ~70% faster than full build **What it catches**: - Missing headers - Include path problems - Preprocessor errors - Template instantiation issues - Platform-specific compilation errors --- ### Improvement 3: Symbol Conflict Job **Goal**: Detect ODR violations before linking **Implementation**: ```yaml symbol-check: name: "Symbol Check - ${{ matrix.platform }}" runs-on: ${{ matrix.os }} needs: [build] # Run after full build completes strategy: matrix: include: - os: ubuntu-22.04 platform: linux - os: macos-14 platform: macos - os: windows-2022 platform: windows steps: - uses: actions/checkout@v4 - name: Download build artifacts uses: actions/download-artifact@v4 with: name: build-${{ matrix.platform }} path: build - name: Check for symbol conflicts (Unix) if: matrix.platform != 'windows' run: ./scripts/verify-symbols.sh --build-dir build - name: Check for symbol conflicts (Windows) if: matrix.platform == 'windows' shell: pwsh run: .\scripts\verify-symbols.ps1 -BuildDir build - name: Upload conflict report if: failure() uses: actions/upload-artifact@v4 with: name: symbol-conflicts-${{ matrix.platform }} path: build/symbol-report.txt ``` **Benefits**: - ✅ Catches ODR violations before linking - ✅ Detects FLAGS conflicts (Linux-specific) - ✅ Platform-specific symbol issues - ✅ Runs in parallel with tests (~3 minutes) **What it catches**: - Duplicate symbol definitions - FLAGS_* conflicts (gflags) - ODR violations - Link-time errors (predicted) --- ### Improvement 4: Fail-Fast Strategy **Goal**: Stop wasting resources on doomed builds **Current Behavior**: All jobs run even if one fails **Proposed Behavior**: Stop non-essential jobs if critical jobs fail **Implementation**: ```yaml jobs: # Critical path: These must pass config-validation: # ... (as above) compile-check: needs: [config-validation] strategy: fail-fast: true # Stop all platforms if one fails build: needs: [compile-check] strategy: fail-fast: false # Allow other platforms to continue # Non-critical: These can be skipped if builds fail integration-tests: needs: [build] if: success() # Only run if build succeeded windows-agent: needs: [build, test] if: success() && github.event_name != 'pull_request' ``` **Benefits**: - ✅ Saves ~60 minutes of CI time per failed build - ✅ Faster feedback (no waiting for doomed jobs) - ✅ Reduced resource usage --- ### Improvement 5: Preset Matrix Testing **Goal**: Validate all presets can configure **Implementation**: ```yaml preset-validation: name: "Preset Validation" runs-on: ${{ matrix.os }} strategy: matrix: os: [ubuntu-22.04, macos-14, windows-2022] steps: - uses: actions/checkout@v4 - name: Test all presets for platform run: | for preset in $(cmake --list-presets | grep ${{ matrix.os }} | awk '{print $1}'); do echo "Testing preset: $preset" cmake --preset "$preset" --list-presets || exit 1 done ``` **Benefits**: - ✅ Catches invalid preset definitions - ✅ Validates CMake configuration across all presets - ✅ Fast (<2 minutes) --- ## Proposed CI Pipeline (New) ### Job Dependencies ``` ┌─────────────────────┐ │ config-validation │ (2 min, fail-fast) └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ compile-check │ (5 min, fail-fast) └──────────┬──────────┘ │ ▼ ┌─────────────────────┐ │ build │ (15 min, parallel) └──────────┬──────────┘ │ ├──────────┬──────────┬──────────┐ ▼ ▼ ▼ ▼ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ test │ │ symbol │ │quality │ │sanitize│ │ (5 min)│ │(3 min) │ │(2 min) │ │(20 min)│ └────────┘ └────────┘ └────────┘ └────────┘ ``` ### Time Comparison **Current Pipeline**: - First failure: ~15 minutes (compilation error) - Total time: ~40 minutes (if all succeed) **Proposed Pipeline**: - First failure: ~2 minutes (CMake error) or ~5 minutes (compilation error) - Total time: ~40 minutes (if all succeed) **Time Saved**: - CMake errors: **13 minutes saved** (15 min → 2 min) - Compilation errors: **10 minutes saved** (15 min → 5 min) - Symbol conflicts: **Caught earlier** (no failed PRs) --- ## Implementation Plan ### Phase 1: Quick Wins (Week 1) 1. **Add config-validation job** - Copy composite actions - Add new job to `ci.yml` - Test on feature branch 2. **Add symbol-check script** - Already created: `scripts/verify-symbols.sh` - Add Windows version: `scripts/verify-symbols.ps1` - Test locally 3. **Update job dependencies** - Make `build` depend on `config-validation` - Add fail-fast to compile-check **Deliverables**: - ✅ Config validation catches CMake errors in <2 min - ✅ Symbol checker available for CI - ✅ Fail-fast prevents wasted CI time ### Phase 2: Compilation Checks (Week 2) 1. **Add compile-check job** - Identify representative files - Create compilation target list - Add to CI workflow 2. **Platform-specific smoke tests** - Windows: Check `/std:c++latest` - Linux: Check `-std=c++20` - macOS: Check framework links **Deliverables**: - ✅ Compilation errors caught in <5 min - ✅ Platform-specific issues detected early ### Phase 3: Symbol Validation (Week 3) 1. **Add symbol-check job** - Integrate `verify-symbols.sh` - Upload conflict reports - Add to required checks 2. **Create symbol conflict guide** - Document common issues - Provide fix examples - Link from CI failures **Deliverables**: - ✅ ODR violations caught before merge - ✅ FLAGS conflicts detected automatically ### Phase 4: Optimization (Week 4) 1. **Fine-tune fail-fast** - Identify critical vs optional jobs - Set up conditional execution - Test resource savings 2. **Add caching improvements** - Cache compiled objects - Share artifacts between jobs - Optimize dependency downloads **Deliverables**: - ✅ ~60 minutes CI time saved per failed build - ✅ Faster PR iteration --- ## Success Metrics ### Before Improvements | Metric | Value | |--------|-------| | Time to first failure | 15-20 min | | CI failures per month | ~30 | | Wasted CI time/month | ~8 hours | | PR iteration time | 2-4 hours | | Symbol conflicts caught | 0% (manual) | ### After Improvements (Target) | Metric | Value | |--------|-------| | Time to first failure | **2-5 min** | | CI failures per month | **<10** | | Wasted CI time/month | **<2 hours** | | PR iteration time | **30-60 min** | | Symbol conflicts caught | **100%** | ### ROI Calculation **Time Savings**: - 20 failures/month × 10 min saved = **200 minutes/month** - 10 failed PRs avoided = **~4 hours/month** - **Total: ~5-6 hours/month saved** **Developer Experience**: - Faster feedback → less context switching - Earlier error detection → easier debugging - Fewer CI failures → less frustration --- ## Risks & Mitigations ### Risk 1: False Positives **Risk**: New checks catch issues that aren't real problems **Mitigation**: - Test thoroughly before enabling as required - Allow overrides for known false positives - Iterate on filtering logic ### Risk 2: Increased Complexity **Risk**: More jobs = harder to understand CI failures **Mitigation**: - Clear job names and descriptions - Good error messages with links to docs - Dependency graph visualization ### Risk 3: Slower PR Merges **Risk**: More required checks = slower to merge **Mitigation**: - Make only critical checks required - Run expensive checks post-merge - Provide override mechanism for emergencies --- ## Alternative Approaches Considered ### Approach 1: Pre-commit Hooks **Pros**: Catch issues before pushing **Cons**: Developers can skip, not enforced **Decision**: Provide optional hooks, but rely on CI ### Approach 2: GitHub Actions Matrix Expansion **Pros**: Test more combinations **Cons**: Significantly more CI time **Decision**: Focus on critical paths, expand later if needed ### Approach 3: Self-Hosted Runners **Pros**: Faster builds, more control **Cons**: Maintenance overhead, security concerns **Decision**: Stick with GitHub runners for now --- ## Related Work ### Similar Implementations - **LLVM Project**: Uses compile-only jobs for fast feedback - **Chromium**: Extensive smoke testing before full builds - **Abseil**: Symbol conflict detection in CI ### Best Practices 1. **Fail Fast**: Stop early if critical checks fail 2. **Layered Testing**: Quick checks first, expensive checks later 3. **Clear Feedback**: Good error messages with actionable advice 4. **Caching**: Reuse work across jobs when possible --- ## Appendix A: New CI Jobs (YAML) ### Config Validation Job ```yaml config-validation: name: "Config Validation - ${{ matrix.name }}" runs-on: ${{ matrix.os }} strategy: fail-fast: true matrix: include: - name: "Ubuntu 22.04" os: ubuntu-22.04 preset: ci-linux platform: linux - name: "macOS 14" os: macos-14 preset: ci-macos platform: macos - name: "Windows 2022" os: windows-2022 preset: ci-windows platform: windows steps: - name: Checkout code uses: actions/checkout@v4 with: submodules: recursive - name: Setup build environment uses: ./.github/actions/setup-build with: platform: ${{ matrix.platform }} preset: ${{ matrix.preset }} - name: Validate CMake configuration run: cmake --preset ${{ matrix.preset }} - name: Check configuration shell: bash run: | # Check include paths grep "INCLUDE_DIRECTORIES" build/CMakeCache.txt # Check preset is valid cmake --preset ${{ matrix.preset }} --list-presets ``` ### Compile Check Job ```yaml compile-check: name: "Compile Check - ${{ matrix.name }}" runs-on: ${{ matrix.os }} needs: [config-validation] strategy: fail-fast: true matrix: include: - name: "Ubuntu 22.04" os: ubuntu-22.04 preset: ci-linux platform: linux - name: "macOS 14" os: macos-14 preset: ci-macos platform: macos - name: "Windows 2022" os: windows-2022 preset: ci-windows platform: windows steps: - name: Checkout code uses: actions/checkout@v4 with: submodules: recursive - name: Setup build environment uses: ./.github/actions/setup-build with: platform: ${{ matrix.platform }} preset: ${{ matrix.preset }} - name: Configure project run: cmake --preset ${{ matrix.preset }} - name: Smoke compilation test shell: bash run: ./scripts/pre-push-test.sh --smoke-only --preset ${{ matrix.preset }} ``` ### Symbol Check Job ```yaml symbol-check: name: "Symbol Check - ${{ matrix.name }}" runs-on: ${{ matrix.os }} needs: [build] strategy: matrix: include: - name: "Ubuntu 22.04" os: ubuntu-22.04 platform: linux - name: "macOS 14" os: macos-14 platform: macos steps: - name: Checkout code uses: actions/checkout@v4 - name: Download build artifacts uses: actions/download-artifact@v4 with: name: build-${{ matrix.platform }} path: build - name: Check for symbol conflicts shell: bash run: ./scripts/verify-symbols.sh --build-dir build - name: Upload conflict report if: failure() uses: actions/upload-artifact@v4 with: name: symbol-conflicts-${{ matrix.platform }} path: build/symbol-report.txt ``` --- ## Appendix B: Cost Analysis ### Current Monthly CI Usage (Estimated) | Job | Duration | Runs/Month | Total Time | |-----|----------|------------|------------| | build (3 platforms) | 15 min × 3 | 100 PRs | **75 hours** | | test (3 platforms) | 5 min × 3 | 100 PRs | **25 hours** | | windows-agent | 30 min | 30 | **15 hours** | | code-quality | 2 min | 100 PRs | **3.3 hours** | | memory-sanitizer | 20 min | 50 PRs | **16.7 hours** | | z3ed-agent-test | 15 min | 30 | **7.5 hours** | | **Total** | | | **142.5 hours** | ### Proposed Monthly CI Usage | Job | Duration | Runs/Month | Total Time | |-----|----------|------------|------------| | config-validation (3) | 2 min × 3 | 100 PRs | **10 hours** | | compile-check (3) | 5 min × 3 | 100 PRs | **25 hours** | | build (3 platforms) | 15 min × 3 | 80 PRs | **60 hours** (↓20%) | | test (3 platforms) | 5 min × 3 | 80 PRs | **20 hours** (↓20%) | | symbol-check (2) | 3 min × 2 | 80 PRs | **8 hours** | | windows-agent | 30 min | 25 | **12.5 hours** (↓17%) | | code-quality | 2 min | 100 PRs | **3.3 hours** | | memory-sanitizer | 20 min | 40 PRs | **13.3 hours** (↓20%) | | z3ed-agent-test | 15 min | 25 | **6.25 hours** (↓17%) | | **Total** | | | **158.4 hours** (+11%) | **Net Change**: +16 hours/month (11% increase) **BUT**: - Fewer failed builds (20% reduction) - Faster feedback (10-15 min saved per failure) - Better developer experience (invaluable) **Conclusion**: Slight increase in total CI time, but significant improvement in efficiency and developer experience