Files

scawful 5c4cd57ff8 backend-infra-engineer: Post v0.3.9-hotfix7 snapshot (build cleanup)

2025-12-22 00:20:49 +00:00

19 KiB

Raw Blame History

CI/CD Improvements Proposal

Executive Summary

This document proposes specific improvements to the YAZE CI/CD pipeline to catch build failures earlier, reduce wasted CI time, and provide faster feedback to developers.

Goals:

Reduce time-to-first-failure from ~15 minutes to <5 minutes
Catch 90% of failures in fast jobs (<5 min)
Reduce PR iteration time from hours to minutes
Prevent platform-specific issues from reaching CI

ROI:

Time Saved: ~10 minutes per failed build × ~30 failures/month = 5 hours/month
Developer Experience: Faster feedback → less context switching
CI Cost: Minimal (fast jobs use fewer resources)

Current CI Pipeline Analysis

Current Jobs

Job	Platform	Duration	Cost	Catches
build	Ubuntu/macOS/Windows	15-20 min	High	Compilation errors
test	Ubuntu/macOS/Windows	5 min	Medium	Test failures
windows-agent	Windows	30 min	High	AI stack issues
code-quality	Ubuntu	2 min	Low	Format/lint issues
memory-sanitizer	Ubuntu	20 min	High	Memory bugs
z3ed-agent-test	macOS	15 min	High	Agent integration

Total PR Time: ~40 minutes (parallel), ~90 minutes (worst case)

Issues with Current Pipeline

Long feedback loop: 15-20 minutes to find out if headers are missing
Wasted resources: Full 20-minute builds that fail in first 2 minutes
No early validation: CMake configuration succeeds, but compilation fails later
Symbol conflicts detected late: Link errors only appear after full compile
Platform-specific issues: Discovered after 15+ minutes per platform

Proposed Improvements

Improvement 1: Configuration Validation Job

Goal: Catch CMake errors in <2 minutes

Implementation:

config-validation:
  name: "Config Validation - ${{ matrix.preset }}"
  runs-on: ${{ matrix.os }}
  strategy:
    fail-fast: true  # Stop immediately if any fails
    matrix:
      include:
        - os: ubuntu-22.04
          preset: ci-linux
        - os: macos-14
          preset: ci-macos
        - os: windows-2022
          preset: ci-windows

  steps:
    - uses: actions/checkout@v4
      with:
        submodules: recursive

    - name: Setup build environment
      uses: ./.github/actions/setup-build
      with:
        platform: ${{ matrix.platform }}
        preset: ${{ matrix.preset }}

    - name: Validate CMake configuration
      run: |
        cmake --preset ${{ matrix.preset }} \
          -DCMAKE_VERBOSE_MAKEFILE=OFF

    - name: Check include paths
      run: |
        grep "INCLUDE_DIRECTORIES" build/CMakeCache.txt || \
          (echo "Include paths not configured" && exit 1)

    - name: Validate presets
      run: cmake --preset ${{ matrix.preset }} --list-presets

Benefits:

✅ Fails in <2 minutes for CMake errors
✅ Catches missing dependencies immediately
✅ Validates include path propagation
✅ Low resource usage (no compilation)

What it catches:

CMake syntax errors
Missing dependencies (immediate)
Invalid preset definitions
Include path misconfiguration

Improvement 2: Compile-Only Job

Goal: Catch compilation errors in <5 minutes

Implementation:

compile-check:
  name: "Compile Check - ${{ matrix.preset }}"
  runs-on: ${{ matrix.os }}
  needs: [config-validation]  # Run after config validation passes
  strategy:
    fail-fast: false
    matrix:
      include:
        - os: ubuntu-22.04
          preset: ci-linux
          platform: linux
        - os: macos-14
          preset: ci-macos
          platform: macos
        - os: windows-2022
          preset: ci-windows
          platform: windows

  steps:
    - uses: actions/checkout@v4
      with:
        submodules: recursive

    - name: Setup build environment
      uses: ./.github/actions/setup-build
      with:
        platform: ${{ matrix.platform }}
        preset: ${{ matrix.preset }}

    - name: Configure project
      run: cmake --preset ${{ matrix.preset }}

    - name: Compile representative files
      run: |
        # Compile 10-20 key files to catch most header issues
        cmake --build build --target rom.cc.o bitmap.cc.o \
          overworld.cc.o resource_catalog.cc.o \
          dungeon.cc.o sprite.cc.o palette.cc.o \
          asar_wrapper.cc.o controller.cc.o canvas.cc.o \
          --parallel 4

    - name: Check for common issues
      run: |
        # Platform-specific checks
        if [ "${{ matrix.platform }}" = "windows" ]; then
          echo "Checking for /std:c++latest flag..."
          grep "std:c++latest" build/compile_commands.json || \
            echo "Warning: C++20 flag may be missing"
        fi

Benefits:

✅ Catches header issues in ~5 minutes
✅ Tests actual compilation without full build
✅ Platform-specific early detection
✅ ~70% faster than full build

What it catches:

Missing headers
Include path problems
Preprocessor errors
Template instantiation issues
Platform-specific compilation errors

Improvement 3: Symbol Conflict Job

Goal: Detect ODR violations before linking

Implementation:

symbol-check:
  name: "Symbol Check - ${{ matrix.platform }}"
  runs-on: ${{ matrix.os }}
  needs: [build]  # Run after full build completes
  strategy:
    matrix:
      include:
        - os: ubuntu-22.04
          platform: linux
        - os: macos-14
          platform: macos
        - os: windows-2022
          platform: windows

  steps:
    - uses: actions/checkout@v4

    - name: Download build artifacts
      uses: actions/download-artifact@v4
      with:
        name: build-${{ matrix.platform }}
        path: build

    - name: Check for symbol conflicts (Unix)
      if: matrix.platform != 'windows'
      run: ./scripts/verify-symbols.sh --build-dir build

    - name: Check for symbol conflicts (Windows)
      if: matrix.platform == 'windows'
      shell: pwsh
      run: .\scripts\verify-symbols.ps1 -BuildDir build

    - name: Upload conflict report
      if: failure()
      uses: actions/upload-artifact@v4
      with:
        name: symbol-conflicts-${{ matrix.platform }}
        path: build/symbol-report.txt

Benefits:

✅ Catches ODR violations before linking
✅ Detects FLAGS conflicts (Linux-specific)
✅ Platform-specific symbol issues
✅ Runs in parallel with tests (~3 minutes)

What it catches:

Duplicate symbol definitions
FLAGS_* conflicts (gflags)
ODR violations
Link-time errors (predicted)

Improvement 4: Fail-Fast Strategy

Goal: Stop wasting resources on doomed builds

Current Behavior: All jobs run even if one fails Proposed Behavior: Stop non-essential jobs if critical jobs fail

Implementation:

jobs:
  # Critical path: These must pass
  config-validation:
    # ... (as above)

  compile-check:
    needs: [config-validation]
    strategy:
      fail-fast: true  # Stop all platforms if one fails

  build:
    needs: [compile-check]
    strategy:
      fail-fast: false  # Allow other platforms to continue

  # Non-critical: These can be skipped if builds fail
  integration-tests:
    needs: [build]
    if: success()  # Only run if build succeeded

  windows-agent:
    needs: [build, test]
    if: success() && github.event_name != 'pull_request'

Benefits:

✅ Saves ~60 minutes of CI time per failed build
✅ Faster feedback (no waiting for doomed jobs)
✅ Reduced resource usage

Improvement 5: Preset Matrix Testing

Goal: Validate all presets can configure

Implementation:

preset-validation:
  name: "Preset Validation"
  runs-on: ${{ matrix.os }}
  strategy:
    matrix:
      os: [ubuntu-22.04, macos-14, windows-2022]

  steps:
    - uses: actions/checkout@v4

    - name: Test all presets for platform
      run: |
        for preset in $(cmake --list-presets | grep ${{ matrix.os }} | awk '{print $1}'); do
          echo "Testing preset: $preset"
          cmake --preset "$preset" --list-presets || exit 1
        done

Benefits:

✅ Catches invalid preset definitions
✅ Validates CMake configuration across all presets
✅ Fast (<2 minutes)

Proposed CI Pipeline (New)

Job Dependencies

┌─────────────────────┐
│ config-validation   │ (2 min, fail-fast)
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  compile-check      │ (5 min, fail-fast)
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│       build         │ (15 min, parallel)
└──────────┬──────────┘
           │
           ├──────────┬──────────┬──────────┐
           ▼          ▼          ▼          ▼
      ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
      │  test  │ │ symbol │ │quality │ │sanitize│
      │ (5 min)│ │(3 min) │ │(2 min) │ │(20 min)│
      └────────┘ └────────┘ └────────┘ └────────┘

Time Comparison

Current Pipeline:

First failure: ~15 minutes (compilation error)
Total time: ~40 minutes (if all succeed)

Proposed Pipeline:

First failure: ~2 minutes (CMake error) or ~5 minutes (compilation error)
Total time: ~40 minutes (if all succeed)

Time Saved:

CMake errors: 13 minutes saved (15 min → 2 min)
Compilation errors: 10 minutes saved (15 min → 5 min)
Symbol conflicts: Caught earlier (no failed PRs)

Implementation Plan

Phase 1: Quick Wins (Week 1)

Add config-validation job
- Copy composite actions
- Add new job to ci.yml
- Test on feature branch
Add symbol-check script
- Already created: scripts/verify-symbols.sh
- Add Windows version: scripts/verify-symbols.ps1
- Test locally
Update job dependencies
- Make build depend on config-validation
- Add fail-fast to compile-check

Deliverables:

✅ Config validation catches CMake errors in <2 min
✅ Symbol checker available for CI
✅ Fail-fast prevents wasted CI time

Phase 2: Compilation Checks (Week 2)

Add compile-check job
- Identify representative files
- Create compilation target list
- Add to CI workflow
Platform-specific smoke tests
- Windows: Check /std:c++latest
- Linux: Check -std=c++20
- macOS: Check framework links

Deliverables:

✅ Compilation errors caught in <5 min
✅ Platform-specific issues detected early

Phase 3: Symbol Validation (Week 3)

Add symbol-check job
- Integrate verify-symbols.sh
- Upload conflict reports
- Add to required checks
Create symbol conflict guide
- Document common issues
- Provide fix examples
- Link from CI failures

Deliverables:

✅ ODR violations caught before merge
✅ FLAGS conflicts detected automatically

Phase 4: Optimization (Week 4)

Fine-tune fail-fast
- Identify critical vs optional jobs
- Set up conditional execution
- Test resource savings
Add caching improvements
- Cache compiled objects
- Share artifacts between jobs
- Optimize dependency downloads

Deliverables:

✅ ~60 minutes CI time saved per failed build
✅ Faster PR iteration

Success Metrics

Before Improvements

Metric	Value
Time to first failure	15-20 min
CI failures per month	~30
Wasted CI time/month	~8 hours
PR iteration time	2-4 hours
Symbol conflicts caught	0% (manual)

After Improvements (Target)

Metric	Value
Time to first failure	2-5 min
CI failures per month	<10
Wasted CI time/month	<2 hours
PR iteration time	30-60 min
Symbol conflicts caught	100%

ROI Calculation

Time Savings:

20 failures/month × 10 min saved = 200 minutes/month
10 failed PRs avoided = ~4 hours/month
Total: ~5-6 hours/month saved

Developer Experience:

Faster feedback → less context switching
Earlier error detection → easier debugging
Fewer CI failures → less frustration

Risks & Mitigations

Risk 1: False Positives

Risk: New checks catch issues that aren't real problems Mitigation:

Test thoroughly before enabling as required
Allow overrides for known false positives
Iterate on filtering logic

Risk 2: Increased Complexity

Risk: More jobs = harder to understand CI failures Mitigation:

Clear job names and descriptions
Good error messages with links to docs
Dependency graph visualization

Risk 3: Slower PR Merges

Risk: More required checks = slower to merge Mitigation:

Make only critical checks required
Run expensive checks post-merge
Provide override mechanism for emergencies

Alternative Approaches Considered

Approach 1: Pre-commit Hooks

Pros: Catch issues before pushing Cons: Developers can skip, not enforced Decision: Provide optional hooks, but rely on CI

Approach 2: GitHub Actions Matrix Expansion

Pros: Test more combinations Cons: Significantly more CI time Decision: Focus on critical paths, expand later if needed

Approach 3: Self-Hosted Runners

Pros: Faster builds, more control Cons: Maintenance overhead, security concerns Decision: Stick with GitHub runners for now

Similar Implementations

LLVM Project: Uses compile-only jobs for fast feedback
Chromium: Extensive smoke testing before full builds
Abseil: Symbol conflict detection in CI

Best Practices

Fail Fast: Stop early if critical checks fail
Layered Testing: Quick checks first, expensive checks later
Clear Feedback: Good error messages with actionable advice
Caching: Reuse work across jobs when possible

Appendix A: New CI Jobs (YAML)

Config Validation Job

config-validation:
  name: "Config Validation - ${{ matrix.name }}"
  runs-on: ${{ matrix.os }}
  strategy:
    fail-fast: true
    matrix:
      include:
        - name: "Ubuntu 22.04"
          os: ubuntu-22.04
          preset: ci-linux
          platform: linux
        - name: "macOS 14"
          os: macos-14
          preset: ci-macos
          platform: macos
        - name: "Windows 2022"
          os: windows-2022
          preset: ci-windows
          platform: windows

  steps:
    - name: Checkout code
      uses: actions/checkout@v4
      with:
        submodules: recursive

    - name: Setup build environment
      uses: ./.github/actions/setup-build
      with:
        platform: ${{ matrix.platform }}
        preset: ${{ matrix.preset }}

    - name: Validate CMake configuration
      run: cmake --preset ${{ matrix.preset }}

    - name: Check configuration
      shell: bash
      run: |
        # Check include paths
        grep "INCLUDE_DIRECTORIES" build/CMakeCache.txt

        # Check preset is valid
        cmake --preset ${{ matrix.preset }} --list-presets

Compile Check Job

compile-check:
  name: "Compile Check - ${{ matrix.name }}"
  runs-on: ${{ matrix.os }}
  needs: [config-validation]
  strategy:
    fail-fast: true
    matrix:
      include:
        - name: "Ubuntu 22.04"
          os: ubuntu-22.04
          preset: ci-linux
          platform: linux
        - name: "macOS 14"
          os: macos-14
          preset: ci-macos
          platform: macos
        - name: "Windows 2022"
          os: windows-2022
          preset: ci-windows
          platform: windows

  steps:
    - name: Checkout code
      uses: actions/checkout@v4
      with:
        submodules: recursive

    - name: Setup build environment
      uses: ./.github/actions/setup-build
      with:
        platform: ${{ matrix.platform }}
        preset: ${{ matrix.preset }}

    - name: Configure project
      run: cmake --preset ${{ matrix.preset }}

    - name: Smoke compilation test
      shell: bash
      run: ./scripts/pre-push-test.sh --smoke-only --preset ${{ matrix.preset }}

Symbol Check Job

symbol-check:
  name: "Symbol Check - ${{ matrix.name }}"
  runs-on: ${{ matrix.os }}
  needs: [build]
  strategy:
    matrix:
      include:
        - name: "Ubuntu 22.04"
          os: ubuntu-22.04
          platform: linux
        - name: "macOS 14"
          os: macos-14
          platform: macos

  steps:
    - name: Checkout code
      uses: actions/checkout@v4

    - name: Download build artifacts
      uses: actions/download-artifact@v4
      with:
        name: build-${{ matrix.platform }}
        path: build

    - name: Check for symbol conflicts
      shell: bash
      run: ./scripts/verify-symbols.sh --build-dir build

    - name: Upload conflict report
      if: failure()
      uses: actions/upload-artifact@v4
      with:
        name: symbol-conflicts-${{ matrix.platform }}
        path: build/symbol-report.txt

Appendix B: Cost Analysis

Current Monthly CI Usage (Estimated)

Job	Duration	Runs/Month	Total Time
build (3 platforms)	15 min × 3	100 PRs	75 hours
test (3 platforms)	5 min × 3	100 PRs	25 hours
windows-agent	30 min	30	15 hours
code-quality	2 min	100 PRs	3.3 hours
memory-sanitizer	20 min	50 PRs	16.7 hours
z3ed-agent-test	15 min	30	7.5 hours
Total			142.5 hours

Proposed Monthly CI Usage

Job	Duration	Runs/Month	Total Time
config-validation (3)	2 min × 3	100 PRs	10 hours
compile-check (3)	5 min × 3	100 PRs	25 hours
build (3 platforms)	15 min × 3	80 PRs	60 hours (↓20%)
test (3 platforms)	5 min × 3	80 PRs	20 hours (↓20%)
symbol-check (2)	3 min × 2	80 PRs	8 hours
windows-agent	30 min	25	12.5 hours (↓17%)
code-quality	2 min	100 PRs	3.3 hours
memory-sanitizer	20 min	40 PRs	13.3 hours (↓20%)
z3ed-agent-test	15 min	25	6.25 hours (↓17%)
Total			158.4 hours (+11%)

Net Change: +16 hours/month (11% increase)

BUT:

Fewer failed builds (20% reduction)
Faster feedback (10-15 min saved per failure)
Better developer experience (invaluable)

Conclusion: Slight increase in total CI time, but significant improvement in efficiency and developer experience

19 KiB Raw Blame History Unescape Escape

CI/CD Improvements Proposal

Executive Summary

Current CI Pipeline Analysis

Current Jobs

Issues with Current Pipeline

Proposed Improvements

Improvement 1: Configuration Validation Job

Improvement 2: Compile-Only Job

Improvement 3: Symbol Conflict Job

Improvement 4: Fail-Fast Strategy

Improvement 5: Preset Matrix Testing

Proposed CI Pipeline (New)

Job Dependencies

Time Comparison

Implementation Plan

Phase 1: Quick Wins (Week 1)

Phase 2: Compilation Checks (Week 2)

Phase 3: Symbol Validation (Week 3)

Phase 4: Optimization (Week 4)

Success Metrics

Before Improvements

After Improvements (Target)

ROI Calculation

Risks & Mitigations

Risk 1: False Positives

Risk 2: Increased Complexity

Risk 3: Slower PR Merges

Alternative Approaches Considered

Approach 1: Pre-commit Hooks

Approach 2: GitHub Actions Matrix Expansion

Approach 3: Self-Hosted Runners

Related Work

Similar Implementations

Best Practices

Appendix A: New CI Jobs (YAML)

Config Validation Job

Compile Check Job

Symbol Check Job

Appendix B: Cost Analysis

Current Monthly CI Usage (Estimated)

Proposed Monthly CI Usage

19 KiB

Raw Blame History