19 KiB
CI/CD Improvements Proposal
Executive Summary
This document proposes specific improvements to the YAZE CI/CD pipeline to catch build failures earlier, reduce wasted CI time, and provide faster feedback to developers.
Goals:
- Reduce time-to-first-failure from ~15 minutes to <5 minutes
- Catch 90% of failures in fast jobs (<5 min)
- Reduce PR iteration time from hours to minutes
- Prevent platform-specific issues from reaching CI
ROI:
- Time Saved: ~10 minutes per failed build × ~30 failures/month = 5 hours/month
- Developer Experience: Faster feedback → less context switching
- CI Cost: Minimal (fast jobs use fewer resources)
Current CI Pipeline Analysis
Current Jobs
| Job | Platform | Duration | Cost | Catches |
|---|---|---|---|---|
| build | Ubuntu/macOS/Windows | 15-20 min | High | Compilation errors |
| test | Ubuntu/macOS/Windows | 5 min | Medium | Test failures |
| windows-agent | Windows | 30 min | High | AI stack issues |
| code-quality | Ubuntu | 2 min | Low | Format/lint issues |
| memory-sanitizer | Ubuntu | 20 min | High | Memory bugs |
| z3ed-agent-test | macOS | 15 min | High | Agent integration |
Total PR Time: ~40 minutes (parallel), ~90 minutes (worst case)
Issues with Current Pipeline
- Long feedback loop: 15-20 minutes to find out if headers are missing
- Wasted resources: Full 20-minute builds that fail in first 2 minutes
- No early validation: CMake configuration succeeds, but compilation fails later
- Symbol conflicts detected late: Link errors only appear after full compile
- Platform-specific issues: Discovered after 15+ minutes per platform
Proposed Improvements
Improvement 1: Configuration Validation Job
Goal: Catch CMake errors in <2 minutes
Implementation:
config-validation:
name: "Config Validation - ${{ matrix.preset }}"
runs-on: ${{ matrix.os }}
strategy:
fail-fast: true # Stop immediately if any fails
matrix:
include:
- os: ubuntu-22.04
preset: ci-linux
- os: macos-14
preset: ci-macos
- os: windows-2022
preset: ci-windows
steps:
- uses: actions/checkout@v4
with:
submodules: recursive
- name: Setup build environment
uses: ./.github/actions/setup-build
with:
platform: ${{ matrix.platform }}
preset: ${{ matrix.preset }}
- name: Validate CMake configuration
run: |
cmake --preset ${{ matrix.preset }} \
-DCMAKE_VERBOSE_MAKEFILE=OFF
- name: Check include paths
run: |
grep "INCLUDE_DIRECTORIES" build/CMakeCache.txt || \
(echo "Include paths not configured" && exit 1)
- name: Validate presets
run: cmake --preset ${{ matrix.preset }} --list-presets
Benefits:
- ✅ Fails in <2 minutes for CMake errors
- ✅ Catches missing dependencies immediately
- ✅ Validates include path propagation
- ✅ Low resource usage (no compilation)
What it catches:
- CMake syntax errors
- Missing dependencies (immediate)
- Invalid preset definitions
- Include path misconfiguration
Improvement 2: Compile-Only Job
Goal: Catch compilation errors in <5 minutes
Implementation:
compile-check:
name: "Compile Check - ${{ matrix.preset }}"
runs-on: ${{ matrix.os }}
needs: [config-validation] # Run after config validation passes
strategy:
fail-fast: false
matrix:
include:
- os: ubuntu-22.04
preset: ci-linux
platform: linux
- os: macos-14
preset: ci-macos
platform: macos
- os: windows-2022
preset: ci-windows
platform: windows
steps:
- uses: actions/checkout@v4
with:
submodules: recursive
- name: Setup build environment
uses: ./.github/actions/setup-build
with:
platform: ${{ matrix.platform }}
preset: ${{ matrix.preset }}
- name: Configure project
run: cmake --preset ${{ matrix.preset }}
- name: Compile representative files
run: |
# Compile 10-20 key files to catch most header issues
cmake --build build --target rom.cc.o bitmap.cc.o \
overworld.cc.o resource_catalog.cc.o \
dungeon.cc.o sprite.cc.o palette.cc.o \
asar_wrapper.cc.o controller.cc.o canvas.cc.o \
--parallel 4
- name: Check for common issues
run: |
# Platform-specific checks
if [ "${{ matrix.platform }}" = "windows" ]; then
echo "Checking for /std:c++latest flag..."
grep "std:c++latest" build/compile_commands.json || \
echo "Warning: C++20 flag may be missing"
fi
Benefits:
- ✅ Catches header issues in ~5 minutes
- ✅ Tests actual compilation without full build
- ✅ Platform-specific early detection
- ✅ ~70% faster than full build
What it catches:
- Missing headers
- Include path problems
- Preprocessor errors
- Template instantiation issues
- Platform-specific compilation errors
Improvement 3: Symbol Conflict Job
Goal: Detect ODR violations before linking
Implementation:
symbol-check:
name: "Symbol Check - ${{ matrix.platform }}"
runs-on: ${{ matrix.os }}
needs: [build] # Run after full build completes
strategy:
matrix:
include:
- os: ubuntu-22.04
platform: linux
- os: macos-14
platform: macos
- os: windows-2022
platform: windows
steps:
- uses: actions/checkout@v4
- name: Download build artifacts
uses: actions/download-artifact@v4
with:
name: build-${{ matrix.platform }}
path: build
- name: Check for symbol conflicts (Unix)
if: matrix.platform != 'windows'
run: ./scripts/verify-symbols.sh --build-dir build
- name: Check for symbol conflicts (Windows)
if: matrix.platform == 'windows'
shell: pwsh
run: .\scripts\verify-symbols.ps1 -BuildDir build
- name: Upload conflict report
if: failure()
uses: actions/upload-artifact@v4
with:
name: symbol-conflicts-${{ matrix.platform }}
path: build/symbol-report.txt
Benefits:
- ✅ Catches ODR violations before linking
- ✅ Detects FLAGS conflicts (Linux-specific)
- ✅ Platform-specific symbol issues
- ✅ Runs in parallel with tests (~3 minutes)
What it catches:
- Duplicate symbol definitions
- FLAGS_* conflicts (gflags)
- ODR violations
- Link-time errors (predicted)
Improvement 4: Fail-Fast Strategy
Goal: Stop wasting resources on doomed builds
Current Behavior: All jobs run even if one fails Proposed Behavior: Stop non-essential jobs if critical jobs fail
Implementation:
jobs:
# Critical path: These must pass
config-validation:
# ... (as above)
compile-check:
needs: [config-validation]
strategy:
fail-fast: true # Stop all platforms if one fails
build:
needs: [compile-check]
strategy:
fail-fast: false # Allow other platforms to continue
# Non-critical: These can be skipped if builds fail
integration-tests:
needs: [build]
if: success() # Only run if build succeeded
windows-agent:
needs: [build, test]
if: success() && github.event_name != 'pull_request'
Benefits:
- ✅ Saves ~60 minutes of CI time per failed build
- ✅ Faster feedback (no waiting for doomed jobs)
- ✅ Reduced resource usage
Improvement 5: Preset Matrix Testing
Goal: Validate all presets can configure
Implementation:
preset-validation:
name: "Preset Validation"
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-22.04, macos-14, windows-2022]
steps:
- uses: actions/checkout@v4
- name: Test all presets for platform
run: |
for preset in $(cmake --list-presets | grep ${{ matrix.os }} | awk '{print $1}'); do
echo "Testing preset: $preset"
cmake --preset "$preset" --list-presets || exit 1
done
Benefits:
- ✅ Catches invalid preset definitions
- ✅ Validates CMake configuration across all presets
- ✅ Fast (<2 minutes)
Proposed CI Pipeline (New)
Job Dependencies
┌─────────────────────┐
│ config-validation │ (2 min, fail-fast)
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ compile-check │ (5 min, fail-fast)
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ build │ (15 min, parallel)
└──────────┬──────────┘
│
├──────────┬──────────┬──────────┐
▼ ▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ test │ │ symbol │ │quality │ │sanitize│
│ (5 min)│ │(3 min) │ │(2 min) │ │(20 min)│
└────────┘ └────────┘ └────────┘ └────────┘
Time Comparison
Current Pipeline:
- First failure: ~15 minutes (compilation error)
- Total time: ~40 minutes (if all succeed)
Proposed Pipeline:
- First failure: ~2 minutes (CMake error) or ~5 minutes (compilation error)
- Total time: ~40 minutes (if all succeed)
Time Saved:
- CMake errors: 13 minutes saved (15 min → 2 min)
- Compilation errors: 10 minutes saved (15 min → 5 min)
- Symbol conflicts: Caught earlier (no failed PRs)
Implementation Plan
Phase 1: Quick Wins (Week 1)
-
Add config-validation job
- Copy composite actions
- Add new job to
ci.yml - Test on feature branch
-
Add symbol-check script
- Already created:
scripts/verify-symbols.sh - Add Windows version:
scripts/verify-symbols.ps1 - Test locally
- Already created:
-
Update job dependencies
- Make
builddepend onconfig-validation - Add fail-fast to compile-check
- Make
Deliverables:
- ✅ Config validation catches CMake errors in <2 min
- ✅ Symbol checker available for CI
- ✅ Fail-fast prevents wasted CI time
Phase 2: Compilation Checks (Week 2)
-
Add compile-check job
- Identify representative files
- Create compilation target list
- Add to CI workflow
-
Platform-specific smoke tests
- Windows: Check
/std:c++latest - Linux: Check
-std=c++20 - macOS: Check framework links
- Windows: Check
Deliverables:
- ✅ Compilation errors caught in <5 min
- ✅ Platform-specific issues detected early
Phase 3: Symbol Validation (Week 3)
-
Add symbol-check job
- Integrate
verify-symbols.sh - Upload conflict reports
- Add to required checks
- Integrate
-
Create symbol conflict guide
- Document common issues
- Provide fix examples
- Link from CI failures
Deliverables:
- ✅ ODR violations caught before merge
- ✅ FLAGS conflicts detected automatically
Phase 4: Optimization (Week 4)
-
Fine-tune fail-fast
- Identify critical vs optional jobs
- Set up conditional execution
- Test resource savings
-
Add caching improvements
- Cache compiled objects
- Share artifacts between jobs
- Optimize dependency downloads
Deliverables:
- ✅ ~60 minutes CI time saved per failed build
- ✅ Faster PR iteration
Success Metrics
Before Improvements
| Metric | Value |
|---|---|
| Time to first failure | 15-20 min |
| CI failures per month | ~30 |
| Wasted CI time/month | ~8 hours |
| PR iteration time | 2-4 hours |
| Symbol conflicts caught | 0% (manual) |
After Improvements (Target)
| Metric | Value |
|---|---|
| Time to first failure | 2-5 min |
| CI failures per month | <10 |
| Wasted CI time/month | <2 hours |
| PR iteration time | 30-60 min |
| Symbol conflicts caught | 100% |
ROI Calculation
Time Savings:
- 20 failures/month × 10 min saved = 200 minutes/month
- 10 failed PRs avoided = ~4 hours/month
- Total: ~5-6 hours/month saved
Developer Experience:
- Faster feedback → less context switching
- Earlier error detection → easier debugging
- Fewer CI failures → less frustration
Risks & Mitigations
Risk 1: False Positives
Risk: New checks catch issues that aren't real problems Mitigation:
- Test thoroughly before enabling as required
- Allow overrides for known false positives
- Iterate on filtering logic
Risk 2: Increased Complexity
Risk: More jobs = harder to understand CI failures Mitigation:
- Clear job names and descriptions
- Good error messages with links to docs
- Dependency graph visualization
Risk 3: Slower PR Merges
Risk: More required checks = slower to merge Mitigation:
- Make only critical checks required
- Run expensive checks post-merge
- Provide override mechanism for emergencies
Alternative Approaches Considered
Approach 1: Pre-commit Hooks
Pros: Catch issues before pushing Cons: Developers can skip, not enforced Decision: Provide optional hooks, but rely on CI
Approach 2: GitHub Actions Matrix Expansion
Pros: Test more combinations Cons: Significantly more CI time Decision: Focus on critical paths, expand later if needed
Approach 3: Self-Hosted Runners
Pros: Faster builds, more control Cons: Maintenance overhead, security concerns Decision: Stick with GitHub runners for now
Related Work
Similar Implementations
- LLVM Project: Uses compile-only jobs for fast feedback
- Chromium: Extensive smoke testing before full builds
- Abseil: Symbol conflict detection in CI
Best Practices
- Fail Fast: Stop early if critical checks fail
- Layered Testing: Quick checks first, expensive checks later
- Clear Feedback: Good error messages with actionable advice
- Caching: Reuse work across jobs when possible
Appendix A: New CI Jobs (YAML)
Config Validation Job
config-validation:
name: "Config Validation - ${{ matrix.name }}"
runs-on: ${{ matrix.os }}
strategy:
fail-fast: true
matrix:
include:
- name: "Ubuntu 22.04"
os: ubuntu-22.04
preset: ci-linux
platform: linux
- name: "macOS 14"
os: macos-14
preset: ci-macos
platform: macos
- name: "Windows 2022"
os: windows-2022
preset: ci-windows
platform: windows
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
submodules: recursive
- name: Setup build environment
uses: ./.github/actions/setup-build
with:
platform: ${{ matrix.platform }}
preset: ${{ matrix.preset }}
- name: Validate CMake configuration
run: cmake --preset ${{ matrix.preset }}
- name: Check configuration
shell: bash
run: |
# Check include paths
grep "INCLUDE_DIRECTORIES" build/CMakeCache.txt
# Check preset is valid
cmake --preset ${{ matrix.preset }} --list-presets
Compile Check Job
compile-check:
name: "Compile Check - ${{ matrix.name }}"
runs-on: ${{ matrix.os }}
needs: [config-validation]
strategy:
fail-fast: true
matrix:
include:
- name: "Ubuntu 22.04"
os: ubuntu-22.04
preset: ci-linux
platform: linux
- name: "macOS 14"
os: macos-14
preset: ci-macos
platform: macos
- name: "Windows 2022"
os: windows-2022
preset: ci-windows
platform: windows
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
submodules: recursive
- name: Setup build environment
uses: ./.github/actions/setup-build
with:
platform: ${{ matrix.platform }}
preset: ${{ matrix.preset }}
- name: Configure project
run: cmake --preset ${{ matrix.preset }}
- name: Smoke compilation test
shell: bash
run: ./scripts/pre-push-test.sh --smoke-only --preset ${{ matrix.preset }}
Symbol Check Job
symbol-check:
name: "Symbol Check - ${{ matrix.name }}"
runs-on: ${{ matrix.os }}
needs: [build]
strategy:
matrix:
include:
- name: "Ubuntu 22.04"
os: ubuntu-22.04
platform: linux
- name: "macOS 14"
os: macos-14
platform: macos
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Download build artifacts
uses: actions/download-artifact@v4
with:
name: build-${{ matrix.platform }}
path: build
- name: Check for symbol conflicts
shell: bash
run: ./scripts/verify-symbols.sh --build-dir build
- name: Upload conflict report
if: failure()
uses: actions/upload-artifact@v4
with:
name: symbol-conflicts-${{ matrix.platform }}
path: build/symbol-report.txt
Appendix B: Cost Analysis
Current Monthly CI Usage (Estimated)
| Job | Duration | Runs/Month | Total Time |
|---|---|---|---|
| build (3 platforms) | 15 min × 3 | 100 PRs | 75 hours |
| test (3 platforms) | 5 min × 3 | 100 PRs | 25 hours |
| windows-agent | 30 min | 30 | 15 hours |
| code-quality | 2 min | 100 PRs | 3.3 hours |
| memory-sanitizer | 20 min | 50 PRs | 16.7 hours |
| z3ed-agent-test | 15 min | 30 | 7.5 hours |
| Total | 142.5 hours |
Proposed Monthly CI Usage
| Job | Duration | Runs/Month | Total Time |
|---|---|---|---|
| config-validation (3) | 2 min × 3 | 100 PRs | 10 hours |
| compile-check (3) | 5 min × 3 | 100 PRs | 25 hours |
| build (3 platforms) | 15 min × 3 | 80 PRs | 60 hours (↓20%) |
| test (3 platforms) | 5 min × 3 | 80 PRs | 20 hours (↓20%) |
| symbol-check (2) | 3 min × 2 | 80 PRs | 8 hours |
| windows-agent | 30 min | 25 | 12.5 hours (↓17%) |
| code-quality | 2 min | 100 PRs | 3.3 hours |
| memory-sanitizer | 20 min | 40 PRs | 13.3 hours (↓20%) |
| z3ed-agent-test | 15 min | 25 | 6.25 hours (↓17%) |
| Total | 158.4 hours (+11%) |
Net Change: +16 hours/month (11% increase)
BUT:
- Fewer failed builds (20% reduction)
- Faster feedback (10-15 min saved per failure)
- Better developer experience (invaluable)
Conclusion: Slight increase in total CI time, but significant improvement in efficiency and developer experience