yaze/docs/internal/testing/ci-improvements-proposal.md

# CI/CD Improvements Proposal

## Executive Summary

This document proposes specific improvements to the YAZE CI/CD pipeline to catch build failures earlier, reduce wasted CI time, and provide faster feedback to developers.

**Goals**:
- Reduce time-to-first-failure from ~15 minutes to <5 minutes
- Catch 90% of failures in fast jobs (<5 min)
- Reduce PR iteration time from hours to minutes
- Prevent platform-specific issues from reaching CI

**ROI**:
- **Time Saved**: ~10 minutes per failed build × ~30 failures/month = **5 hours/month**
- **Developer Experience**: Faster feedback → less context switching
- **CI Cost**: Minimal (fast jobs use fewer resources)

---

## Current CI Pipeline Analysis

### Current Jobs

| Job | Platform | Duration | Cost | Catches |
|-----|----------|----------|------|---------|
| build | Ubuntu/macOS/Windows | 15-20 min | High | Compilation errors |
| test | Ubuntu/macOS/Windows | 5 min | Medium | Test failures |
| windows-agent | Windows | 30 min | High | AI stack issues |
| code-quality | Ubuntu | 2 min | Low | Format/lint issues |
| memory-sanitizer | Ubuntu | 20 min | High | Memory bugs |
| z3ed-agent-test | macOS | 15 min | High | Agent integration |

**Total PR Time**: ~40 minutes (parallel), ~90 minutes (worst case)

### Issues with Current Pipeline

1. **Long feedback loop**: 15-20 minutes to find out if headers are missing
2. **Wasted resources**: Full 20-minute builds that fail in first 2 minutes
3. **No early validation**: CMake configuration succeeds, but compilation fails later
4. **Symbol conflicts detected late**: Link errors only appear after full compile
5. **Platform-specific issues**: Discovered after 15+ minutes per platform

---

## Proposed Improvements

### Improvement 1: Configuration Validation Job

**Goal**: Catch CMake errors in <2 minutes

**Implementation**:
```yaml
config-validation:
  name: "Config Validation - ${{ matrix.preset }}"
  runs-on: ${{ matrix.os }}
  strategy:
    fail-fast: true  # Stop immediately if any fails
    matrix:
      include:
        - os: ubuntu-22.04
          preset: ci-linux
        - os: macos-14
          preset: ci-macos
        - os: windows-2022
          preset: ci-windows

  steps:
    - uses: actions/checkout@v4
      with:
        submodules: recursive

    - name: Setup build environment
      uses: ./.github/actions/setup-build
      with:
        platform: ${{ matrix.platform }}
        preset: ${{ matrix.preset }}

    - name: Validate CMake configuration
      run: |
        cmake --preset ${{ matrix.preset }} \
          -DCMAKE_VERBOSE_MAKEFILE=OFF

    - name: Check include paths
      run: |
        grep "INCLUDE_DIRECTORIES" build/CMakeCache.txt || \
          (echo "Include paths not configured" && exit 1)

    - name: Validate presets
      run: cmake --preset ${{ matrix.preset }} --list-presets
```

**Benefits**:
- ✅ Fails in <2 minutes for CMake errors
- ✅ Catches missing dependencies immediately
- ✅ Validates include path propagation
- ✅ Low resource usage (no compilation)

**What it catches**:
- CMake syntax errors
- Missing dependencies (immediate)
- Invalid preset definitions
- Include path misconfiguration

---

### Improvement 2: Compile-Only Job

**Goal**: Catch compilation errors in <5 minutes

**Implementation**:
```yaml
compile-check:
  name: "Compile Check - ${{ matrix.preset }}"
  runs-on: ${{ matrix.os }}
  needs: [config-validation]  # Run after config validation passes
  strategy:
    fail-fast: false
    matrix:
      include:
        - os: ubuntu-22.04
          preset: ci-linux
          platform: linux
        - os: macos-14
          preset: ci-macos
          platform: macos
        - os: windows-2022
          preset: ci-windows
          platform: windows

  steps:
    - uses: actions/checkout@v4
      with:
        submodules: recursive

    - name: Setup build environment
      uses: ./.github/actions/setup-build
      with:
        platform: ${{ matrix.platform }}
        preset: ${{ matrix.preset }}

    - name: Configure project
      run: cmake --preset ${{ matrix.preset }}

    - name: Compile representative files
      run: |
        # Compile 10-20 key files to catch most header issues
        cmake --build build --target rom.cc.o bitmap.cc.o \
          overworld.cc.o resource_catalog.cc.o \
          dungeon.cc.o sprite.cc.o palette.cc.o \
          asar_wrapper.cc.o controller.cc.o canvas.cc.o \
          --parallel 4

    - name: Check for common issues
      run: |
        # Platform-specific checks
        if [ "${{ matrix.platform }}" = "windows" ]; then
          echo "Checking for /std:c++latest flag..."
          grep "std:c++latest" build/compile_commands.json || \
            echo "Warning: C++20 flag may be missing"
        fi
```

**Benefits**:
- ✅ Catches header issues in ~5 minutes
- ✅ Tests actual compilation without full build
- ✅ Platform-specific early detection
- ✅ ~70% faster than full build

**What it catches**:
- Missing headers
- Include path problems
- Preprocessor errors
- Template instantiation issues
- Platform-specific compilation errors

---

### Improvement 3: Symbol Conflict Job

**Goal**: Detect ODR violations before linking

**Implementation**:
```yaml
symbol-check:
  name: "Symbol Check - ${{ matrix.platform }}"
  runs-on: ${{ matrix.os }}
  needs: [build]  # Run after full build completes
  strategy:
    matrix:
      include:
        - os: ubuntu-22.04
          platform: linux
        - os: macos-14
          platform: macos
        - os: windows-2022
          platform: windows

  steps:
    - uses: actions/checkout@v4

    - name: Download build artifacts
      uses: actions/download-artifact@v4
      with:
        name: build-${{ matrix.platform }}
        path: build

    - name: Check for symbol conflicts (Unix)
      if: matrix.platform != 'windows'
      run: ./scripts/verify-symbols.sh --build-dir build

    - name: Check for symbol conflicts (Windows)
      if: matrix.platform == 'windows'
      shell: pwsh
      run: .\scripts\verify-symbols.ps1 -BuildDir build

    - name: Upload conflict report
      if: failure()
      uses: actions/upload-artifact@v4
      with:
        name: symbol-conflicts-${{ matrix.platform }}
        path: build/symbol-report.txt
```

**Benefits**:
- ✅ Catches ODR violations before linking
- ✅ Detects FLAGS conflicts (Linux-specific)
- ✅ Platform-specific symbol issues
- ✅ Runs in parallel with tests (~3 minutes)

**What it catches**:
- Duplicate symbol definitions
- FLAGS_* conflicts (gflags)
- ODR violations
- Link-time errors (predicted)

---

### Improvement 4: Fail-Fast Strategy

**Goal**: Stop wasting resources on doomed builds

**Current Behavior**: All jobs run even if one fails
**Proposed Behavior**: Stop non-essential jobs if critical jobs fail

**Implementation**:
```yaml
jobs:
  # Critical path: These must pass
  config-validation:
    # ... (as above)

  compile-check:
    needs: [config-validation]
    strategy:
      fail-fast: true  # Stop all platforms if one fails

  build:
    needs: [compile-check]
    strategy:
      fail-fast: false  # Allow other platforms to continue

  # Non-critical: These can be skipped if builds fail
  integration-tests:
    needs: [build]
    if: success()  # Only run if build succeeded

  windows-agent:
    needs: [build, test]
    if: success() && github.event_name != 'pull_request'
```

**Benefits**:
- ✅ Saves ~60 minutes of CI time per failed build
- ✅ Faster feedback (no waiting for doomed jobs)
- ✅ Reduced resource usage

---

### Improvement 5: Preset Matrix Testing

**Goal**: Validate all presets can configure

**Implementation**:
```yaml
preset-validation:
  name: "Preset Validation"
  runs-on: ${{ matrix.os }}
  strategy:
    matrix:
      os: [ubuntu-22.04, macos-14, windows-2022]

  steps:
    - uses: actions/checkout@v4

    - name: Test all presets for platform
      run: |
        for preset in $(cmake --list-presets | grep ${{ matrix.os }} | awk '{print $1}'); do
          echo "Testing preset: $preset"
          cmake --preset "$preset" --list-presets || exit 1
        done
```

**Benefits**:
- ✅ Catches invalid preset definitions
- ✅ Validates CMake configuration across all presets
- ✅ Fast (<2 minutes)

---

## Proposed CI Pipeline (New)

### Job Dependencies

```
┌─────────────────────┐
│ config-validation   │ (2 min, fail-fast)
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  compile-check      │ (5 min, fail-fast)
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│       build         │ (15 min, parallel)
└──────────┬──────────┘
           │
           ├──────────┬──────────┬──────────┐
           ▼          ▼          ▼          ▼
      ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
      │  test  │ │ symbol │ │quality │ │sanitize│
      │ (5 min)│ │(3 min) │ │(2 min) │ │(20 min)│
      └────────┘ └────────┘ └────────┘ └────────┘
```

### Time Comparison

**Current Pipeline**:
- First failure: ~15 minutes (compilation error)
- Total time: ~40 minutes (if all succeed)

**Proposed Pipeline**:
- First failure: ~2 minutes (CMake error) or ~5 minutes (compilation error)
- Total time: ~40 minutes (if all succeed)

**Time Saved**:
- CMake errors: **13 minutes saved** (15 min → 2 min)
- Compilation errors: **10 minutes saved** (15 min → 5 min)
- Symbol conflicts: **Caught earlier** (no failed PRs)

---

## Implementation Plan

### Phase 1: Quick Wins (Week 1)

1. **Add config-validation job**
   - Copy composite actions
   - Add new job to `ci.yml`
   - Test on feature branch

2. **Add symbol-check script**
   - Already created: `scripts/verify-symbols.sh`
   - Add Windows version: `scripts/verify-symbols.ps1`
   - Test locally

3. **Update job dependencies**
   - Make `build` depend on `config-validation`
   - Add fail-fast to compile-check

**Deliverables**:
- ✅ Config validation catches CMake errors in <2 min
- ✅ Symbol checker available for CI
- ✅ Fail-fast prevents wasted CI time

### Phase 2: Compilation Checks (Week 2)

1. **Add compile-check job**
   - Identify representative files
   - Create compilation target list
   - Add to CI workflow

2. **Platform-specific smoke tests**
   - Windows: Check `/std:c++latest`
   - Linux: Check `-std=c++20`
   - macOS: Check framework links

**Deliverables**:
- ✅ Compilation errors caught in <5 min
- ✅ Platform-specific issues detected early

### Phase 3: Symbol Validation (Week 3)

1. **Add symbol-check job**
   - Integrate `verify-symbols.sh`
   - Upload conflict reports
   - Add to required checks

2. **Create symbol conflict guide**
   - Document common issues
   - Provide fix examples
   - Link from CI failures

**Deliverables**:
- ✅ ODR violations caught before merge
- ✅ FLAGS conflicts detected automatically

### Phase 4: Optimization (Week 4)

1. **Fine-tune fail-fast**
   - Identify critical vs optional jobs
   - Set up conditional execution
   - Test resource savings

2. **Add caching improvements**
   - Cache compiled objects
   - Share artifacts between jobs
   - Optimize dependency downloads

**Deliverables**:
- ✅ ~60 minutes CI time saved per failed build
- ✅ Faster PR iteration

---

## Success Metrics

### Before Improvements

| Metric | Value |
|--------|-------|
| Time to first failure | 15-20 min |
| CI failures per month | ~30 |
| Wasted CI time/month | ~8 hours |
| PR iteration time | 2-4 hours |
| Symbol conflicts caught | 0% (manual) |

### After Improvements (Target)

| Metric | Value |
|--------|-------|
| Time to first failure | **2-5 min** |
| CI failures per month | **<10** |
| Wasted CI time/month | **<2 hours** |
| PR iteration time | **30-60 min** |
| Symbol conflicts caught | **100%** |

### ROI Calculation

**Time Savings**:
- 20 failures/month × 10 min saved = **200 minutes/month**
- 10 failed PRs avoided = **~4 hours/month**
- **Total: ~5-6 hours/month saved**

**Developer Experience**:
- Faster feedback → less context switching
- Earlier error detection → easier debugging
- Fewer CI failures → less frustration

---

## Risks & Mitigations

### Risk 1: False Positives
**Risk**: New checks catch issues that aren't real problems
**Mitigation**:
- Test thoroughly before enabling as required
- Allow overrides for known false positives
- Iterate on filtering logic

### Risk 2: Increased Complexity
**Risk**: More jobs = harder to understand CI failures
**Mitigation**:
- Clear job names and descriptions
- Good error messages with links to docs
- Dependency graph visualization

### Risk 3: Slower PR Merges
**Risk**: More required checks = slower to merge
**Mitigation**:
- Make only critical checks required
- Run expensive checks post-merge
- Provide override mechanism for emergencies

---

## Alternative Approaches Considered

### Approach 1: Pre-commit Hooks
**Pros**: Catch issues before pushing
**Cons**: Developers can skip, not enforced
**Decision**: Provide optional hooks, but rely on CI

### Approach 2: GitHub Actions Matrix Expansion
**Pros**: Test more combinations
**Cons**: Significantly more CI time
**Decision**: Focus on critical paths, expand later if needed

### Approach 3: Self-Hosted Runners
**Pros**: Faster builds, more control
**Cons**: Maintenance overhead, security concerns
**Decision**: Stick with GitHub runners for now

---

## Related Work

### Similar Implementations
- **LLVM Project**: Uses compile-only jobs for fast feedback
- **Chromium**: Extensive smoke testing before full builds
- **Abseil**: Symbol conflict detection in CI

### Best Practices
1. **Fail Fast**: Stop early if critical checks fail
2. **Layered Testing**: Quick checks first, expensive checks later
3. **Clear Feedback**: Good error messages with actionable advice
4. **Caching**: Reuse work across jobs when possible

---

## Appendix A: New CI Jobs (YAML)

### Config Validation Job
```yaml
config-validation:
  name: "Config Validation - ${{ matrix.name }}"
  runs-on: ${{ matrix.os }}
  strategy:
    fail-fast: true
    matrix:
      include:
        - name: "Ubuntu 22.04"
          os: ubuntu-22.04
          preset: ci-linux
          platform: linux
        - name: "macOS 14"
          os: macos-14
          preset: ci-macos
          platform: macos
        - name: "Windows 2022"
          os: windows-2022
          preset: ci-windows
          platform: windows

  steps:
    - name: Checkout code
      uses: actions/checkout@v4
      with:
        submodules: recursive

    - name: Setup build environment
      uses: ./.github/actions/setup-build
      with:
        platform: ${{ matrix.platform }}
        preset: ${{ matrix.preset }}

    - name: Validate CMake configuration
      run: cmake --preset ${{ matrix.preset }}

    - name: Check configuration
      shell: bash
      run: |
        # Check include paths
        grep "INCLUDE_DIRECTORIES" build/CMakeCache.txt

        # Check preset is valid
        cmake --preset ${{ matrix.preset }} --list-presets
```

### Compile Check Job
```yaml
compile-check:
  name: "Compile Check - ${{ matrix.name }}"
  runs-on: ${{ matrix.os }}
  needs: [config-validation]
  strategy:
    fail-fast: true
    matrix:
      include:
        - name: "Ubuntu 22.04"
          os: ubuntu-22.04
          preset: ci-linux
          platform: linux
        - name: "macOS 14"
          os: macos-14
          preset: ci-macos
          platform: macos
        - name: "Windows 2022"
          os: windows-2022
          preset: ci-windows
          platform: windows

  steps:
    - name: Checkout code
      uses: actions/checkout@v4
      with:
        submodules: recursive

    - name: Setup build environment
      uses: ./.github/actions/setup-build
      with:
        platform: ${{ matrix.platform }}
        preset: ${{ matrix.preset }}

    - name: Configure project
      run: cmake --preset ${{ matrix.preset }}

    - name: Smoke compilation test
      shell: bash
      run: ./scripts/pre-push-test.sh --smoke-only --preset ${{ matrix.preset }}
```

### Symbol Check Job
```yaml
symbol-check:
  name: "Symbol Check - ${{ matrix.name }}"
  runs-on: ${{ matrix.os }}
  needs: [build]
  strategy:
    matrix:
      include:
        - name: "Ubuntu 22.04"
          os: ubuntu-22.04
          platform: linux
        - name: "macOS 14"
          os: macos-14
          platform: macos

  steps:
    - name: Checkout code
      uses: actions/checkout@v4

    - name: Download build artifacts
      uses: actions/download-artifact@v4
      with:
        name: build-${{ matrix.platform }}
        path: build

    - name: Check for symbol conflicts
      shell: bash
      run: ./scripts/verify-symbols.sh --build-dir build

    - name: Upload conflict report
      if: failure()
      uses: actions/upload-artifact@v4
      with:
        name: symbol-conflicts-${{ matrix.platform }}
        path: build/symbol-report.txt
```

---

## Appendix B: Cost Analysis

### Current Monthly CI Usage (Estimated)

| Job | Duration | Runs/Month | Total Time |
|-----|----------|------------|------------|
| build (3 platforms) | 15 min × 3 | 100 PRs | **75 hours** |
| test (3 platforms) | 5 min × 3 | 100 PRs | **25 hours** |
| windows-agent | 30 min | 30 | **15 hours** |
| code-quality | 2 min | 100 PRs | **3.3 hours** |
| memory-sanitizer | 20 min | 50 PRs | **16.7 hours** |
| z3ed-agent-test | 15 min | 30 | **7.5 hours** |
| **Total** | | | **142.5 hours** |

### Proposed Monthly CI Usage

| Job | Duration | Runs/Month | Total Time |
|-----|----------|------------|------------|
| config-validation (3) | 2 min × 3 | 100 PRs | **10 hours** |
| compile-check (3) | 5 min × 3 | 100 PRs | **25 hours** |
| build (3 platforms) | 15 min × 3 | 80 PRs | **60 hours** (↓20%) |
| test (3 platforms) | 5 min × 3 | 80 PRs | **20 hours** (↓20%) |
| symbol-check (2) | 3 min × 2 | 80 PRs | **8 hours** |
| windows-agent | 30 min | 25 | **12.5 hours** (↓17%) |
| code-quality | 2 min | 100 PRs | **3.3 hours** |
| memory-sanitizer | 20 min | 40 PRs | **13.3 hours** (↓20%) |
| z3ed-agent-test | 15 min | 25 | **6.25 hours** (↓17%) |
| **Total** | | | **158.4 hours** (+11%) |

**Net Change**: +16 hours/month (11% increase)

**BUT**:
- Fewer failed builds (20% reduction)
- Faster feedback (10-15 min saved per failure)
- Better developer experience (invaluable)

**Conclusion**: Slight increase in total CI time, but significant improvement in efficiency and developer experience