backend-infra-engineer: Release v0.3.3 snapshot

2025-11-21 21:35:50 -05:00
parent 3d71417f62
commit 476dd1cd1c
818 changed files with 65706 additions and 35514 deletions
--- a/docs/internal/research/apu-timing-analysis.md
+++ b/docs/internal/research/apu-timing-analysis.md
@@ -0,0 +1,461 @@
+# APU Timing Fix - Technical Analysis
+
+**Branch:** `feature/apu-timing-fix`
+**Date:** October 10, 2025
+**Status:** Implemented - Core Timing Fixed (Minor Audio Glitches Remain)
+
+---
+
+## Implementation Status
+
+**Completed:**
+- Atomic `Step()` function for SPC700
+- Fixed-point cycle ratio (no floating-point drift)
+- Cycle budget model in APU
+- Removed `bstep` mechanism from instructions.cc
+- Cycle-accurate instruction implementations
+- Proper branch timing (+2 cycles when taken)
+- Dummy read/write cycles for MOV and RMW instructions
+
+**Known Issues:**
+- Some audio glitches/distortion during playback
+- Minor timing inconsistencies under investigation
+- Can be improved in future iterations
+
+**Note:** The APU now executes correctly and music plays, but audio quality can be further refined.
+
+## Problem Summary
+
+The APU fails to load and play music because the SPC700 gets stuck during the initial CPU-APU handshake. This handshake uploads the sound driver from ROM to APU RAM. The timing desynchronization causes infinite loops detected by the watchdog timer.
+
+---
+
+## Current Implementation Analysis
+
+### 1. **Cycle Counting System** (`spc700.cc`)
+
+**Current Approach:**
+```cpp
+// In spc700.h line 87:
+int last_opcode_cycles_ = 0;
+
+// In RunOpcode() line 80:
+last_opcode_cycles_ = spc700_cycles[opcode];  // Static lookup
+```
+
+**Problem:** The `spc700_cycles[]` array provides BASELINE cycle counts only. It does NOT account for:
+- Addressing mode variations
+- Page boundary crossings (+1 cycle)
+- Branch taken vs not taken (+2 cycles if taken)
+- Memory access penalties
+
+### 2. **The `bstep` Mechanism** (`spc700.cc`)
+
+**What is `bstep`?**
+
+`bstep` is a "business step" counter used to spread complex multi-step instructions across multiple calls to `RunOpcode()`.
+
+**Example from line 1108-1115 (opcode 0xCB - MOVSY dp):**
+```cpp
+case 0xcb: {  // movsy dp
+  if (bstep == 0) {
+    adr = dp();  // Save address for bstep=1
+  }
+  if (adr == 0x00F4 && bstep == 1) {
+    LOG_DEBUG("SPC", "MOVSY writing Y=$%02X to F4 at PC=$%04X", Y, PC);
+  }
+  MOVSY(adr);  // Use saved address
+  break;
+}
+```
+
+The `MOVSY()` function internally increments `bstep` to track progress:
+- `bstep=0`: Call `dp()` to get address
+- `bstep=1`: Actually perform the write
+- `bstep=2`: Reset to 0, instruction complete
+
+**Why this is fragile:**
+1. **Non-atomic execution**: An instruction takes 2-3 calls to `RunOpcode()` to complete
+2. **State leakage**: If `bstep` gets out of sync, all future instructions fail
+3. **Cycle accounting errors**: Cycles are consumed incrementally, not atomically
+4. **Debugging nightmare**: Hard to trace when an instruction "really" executes
+
+### 3. **APU Main Loop** (`apu.cc:73-143`)
+
+**Current implementation:**
+```cpp
+void Apu::RunCycles(uint64_t master_cycles) {
+  const double ratio = memory_.pal_timing() ? apuCyclesPerMasterPal : apuCyclesPerMaster;
+  uint64_t master_delta = master_cycles - g_last_master_cycles;
+  g_last_master_cycles = master_cycles;
+
+  const uint64_t target_apu_cycles = cycles_ + static_cast<uint64_t>(master_delta * ratio);
+
+  while (cycles_ < target_apu_cycles) {
+    spc700_.RunOpcode();  // Variable cycles
+    int spc_cycles = spc700_.GetLastOpcodeCycles();
+
+    for (int i = 0; i < spc_cycles; ++i) {
+      Cycle();  // Advance DSP/timers
+    }
+  }
+}
+```
+
+**Problems:**
+1. **Floating-point `ratio`**: `apuCyclesPerMaster` is `double` (line 17), causing precision drift
+2. **Opcode-level granularity**: Advances by opcode, not by cycle
+3. **No sub-cycle accuracy**: Can't model instructions that span multiple cycles
+
+### 4. **Floating-Point Precision** (`apu.cc:17`)
+
+```cpp
+static const double apuCyclesPerMaster = (32040 * 32) / (1364 * 262 * 60.0);
+```
+
+**Calculation:**
+- Numerator: 32040 * 32 = 1,025,280
+- Denominator: 1364 * 262 * 60.0 = 21,437,280
+- Result: ~0.04783 (floating point)
+
+**Problem:** Over thousands of cycles, tiny rounding errors accumulate, causing timing drift.
+
+---
+
+## Root Cause: Handshake Timing Failure
+
+### The Handshake Protocol
+
+1. **APU Ready**: SPC700 writes `$AA` to `$F4`, `$BB` to `$F5`
+2. **CPU Waits**: Main CPU polls for `$BBAA`
+3. **CPU Initiates**: Writes `$CC` to APU input port
+4. **APU Acknowledges**: SPC700 sees `$CC`, prepares to receive
+5. **Byte Transfer Loop**: CPU sends byte, waits for echo confirmation, sends next byte
+
+### Where It Gets Stuck
+
+The SPC700 enters an infinite loop because:
+- **SPC700 is waiting** for a byte from CPU (hasn't arrived yet)
+- **CPU is waiting** for acknowledgment from SPC700 (already sent, but missed)
+
+This happens because cycle counts are off by 1-2 cycles per instruction, which accumulates over the ~500-1000 instructions in the handshake.
+
+---
+
+## LakeSnes Comparison Analysis
+
+### What LakeSnes Does Right
+
+**1. Atomic Instruction Execution (spc.c:73-93)**
+```c
+void spc_runOpcode(Spc* spc) {
+  if(spc->resetWanted) { /* handle reset */ return; }
+  if(spc->stopped) { spc_idleWait(spc); return; }
+
+  uint8_t opcode = spc_readOpcode(spc);
+  spc_doOpcode(spc, opcode);  // COMPLETE instruction in one call
+}
+```
+
+**Key insight:** LakeSnes executes instructions **atomically** - no `bstep`, no `step`, no state leakage.
+
+**2. Cycle Tracking via Callbacks (spc.c:406-409)**
+```c
+static void spc_movsy(Spc* spc, uint16_t adr) {
+  spc_read(spc, adr);          // Calls apu_cycle()
+  spc_write(spc, adr, spc->y); // Calls apu_cycle()
+}
+```
+
+Every `spc_read()`, `spc_write()`, and `spc_idle()` call triggers `apu_cycle()`, which:
+- Advances APU cycle counter
+- Ticks DSP every 32 cycles
+- Updates timers
+
+**3. Simple Addressing Mode Functions (spc.c:189-275)**
+```c
+static uint16_t spc_adrDp(Spc* spc) {
+  return spc_readOpcode(spc) | (spc->p << 8);
+}
+
+static uint16_t spc_adrDpx(Spc* spc) {
+  uint16_t res = ((spc_readOpcode(spc) + spc->x) & 0xff) | (spc->p << 8);
+  spc_idle(spc);  // Extra cycle for indexed addressing
+  return res;
+}
+```
+
+Each memory access and idle call automatically advances cycles.
+
+**4. APU Main Loop (apu.c:73-82)**
+```c
+int apu_runCycles(Apu* apu, int wantedCycles) {
+  int runCycles = 0;
+  uint32_t startCycles = apu->cycles;
+  while(runCycles < wantedCycles) {
+    spc_runOpcode(apu->spc);
+    runCycles += (uint32_t) (apu->cycles - startCycles);
+    startCycles = apu->cycles;
+  }
+  return runCycles;
+}
+```
+
+**Problem:** This approach tracks cycles by **delta**, which works because every memory access calls `apu_cycle()`.
+
+### Where LakeSnes Falls Short (And How We Can Do Better)
+
+**1. No Explicit Cycle Return**
+- LakeSnes relies on tracking `cycles` delta after each opcode
+- Doesn't return precise cycle count from `spc_runOpcode()`
+- Makes it hard to validate cycle accuracy per instruction
+
+**Our improvement:** Return exact cycle count from `Step()`:
+```cpp
+int Spc700::Step() {
+  uint8_t opcode = ReadOpcode();
+  int cycles = CalculatePreciseCycles(opcode);
+  ExecuteInstructionAtomic(opcode);
+  return cycles;  // EXPLICIT return
+}
+```
+
+**2. Implicit Cycle Counting**
+- Cycles accumulated implicitly through callbacks
+- Hard to debug when cycles are wrong
+- No way to verify cycle accuracy per instruction
+
+**Our improvement:** Explicit cycle budget model in `Apu::RunCycles()`:
+```cpp
+while (cycles_ < target_apu_cycles) {
+  int spc_cycles = spc700_.Step();  // Explicit cycle count
+  for (int i = 0; i < spc_cycles; ++i) {
+    Cycle();  // Explicit cycle advancement
+  }
+}
+```
+
+**3. No Fixed-Point Ratio**
+- LakeSnes also uses floating-point (implicitly in SNES main loop)
+- Subject to same precision drift issues
+
+**Our improvement:** Integer numerator/denominator for perfect precision.
+
+### What We're Adopting from LakeSnes
+
+**Atomic instruction execution** - No `bstep` mechanism
+**Simple addressing mode functions** - Return address, advance cycles via callbacks
+**Cycle advancement per memory access** - Every read/write/idle advances cycles
+
+### What We're Improving Over LakeSnes
+
+**Explicit cycle counting** - `Step()` returns exact cycles consumed
+**Cycle budget model** - Clear loop with explicit cycle advancement
+**Fixed-point ratio** - Integer arithmetic for perfect precision
+**Testability** - Easy to verify cycle counts per instruction
+
+---
+
+## Solution Design
+
+### Phase 1: Atomic Instruction Execution
+
+**Goal:** Eliminate `bstep` mechanism entirely.
+
+**New Design:**
+```cpp
+// New function signature
+int Spc700::Step() {
+  if (reset_wanted_) { /* handle reset */ return 8; }
+  if (stopped_) { /* handle stop */ return 2; }
+
+  // Fetch opcode
+  uint8_t opcode = ReadOpcode();
+
+  // Calculate EXACT cycle cost upfront
+  int cycles = CalculatePreciseCycles(opcode);
+
+  // Execute instruction COMPLETELY
+  ExecuteInstructionAtomic(opcode);
+
+  return cycles;  // Return exact cycles consumed
+}
+```
+
+**Benefits:**
+- One call = one complete instruction
+- Cycles calculated before execution
+- No state leakage between calls
+- Easier debugging
+
+### Phase 2: Precise Cycle Calculation
+
+**New function:**
+```cpp
+int Spc700::CalculatePreciseCycles(uint8_t opcode) {
+  int base_cycles = spc700_cycles[opcode];
+
+  // Account for addressing mode penalties
+  switch (opcode) {
+    case 0x10: case 0x30: /* ... branches ... */
+      // Branches: +2 cycles if taken (handled in execution)
+      break;
+    case 0x15: case 0x16: /* ... abs+X, abs+Y ... */
+      // Check if page boundary crossed (+1 cycle)
+      if (will_cross_page_boundary(opcode)) {
+        base_cycles += 1;
+      }
+      break;
+    // ... more addressing mode checks ...
+  }
+
+  return base_cycles;
+}
+```
+
+### Phase 3: Refactor `Apu::RunCycles` to Cycle Budget Model
+
+**New implementation:**
+```cpp
+void Apu::RunCycles(uint64_t master_cycles) {
+  // 1. Calculate target using FIXED-POINT ratio (Phase 4)
+  uint64_t master_delta = master_cycles - g_last_master_cycles;
+  g_last_master_cycles = master_cycles;
+
+  // 2. Fixed-point conversion (avoiding floating point)
+  uint64_t target_apu_cycles = cycles_ + (master_delta * kApuCyclesNumerator) / kApuCyclesDenominator;
+
+  // 3. Run until budget exhausted
+  while (cycles_ < target_apu_cycles) {
+    // 4. Execute ONE instruction atomically
+    int spc_cycles_consumed = spc700_.Step();
+
+    // 5. Advance DSP/timers for each cycle
+    for (int i = 0; i < spc_cycles_consumed; ++i) {
+      Cycle();  // Ticks DSP, timers, increments cycles_
+    }
+  }
+}
+```
+
+### Phase 4: Fixed-Point Cycle Ratio
+
+**Replace floating-point with integer ratio:**
+```cpp
+// Old (apu.cc:17)
+static const double apuCyclesPerMaster = (32040 * 32) / (1364 * 262 * 60.0);
+
+// New
+static constexpr uint64_t kApuCyclesNumerator = 32040 * 32;      // 1,025,280
+static constexpr uint64_t kApuCyclesDenominator = 1364 * 262 * 60;  // 21,437,280
+```
+
+**Conversion:**
+```cpp
+apu_cycles = (master_cycles * kApuCyclesNumerator) / kApuCyclesDenominator;
+```
+
+**Benefits:**
+- Perfect precision (no floating-point drift)
+- Integer arithmetic is faster
+- Deterministic across platforms
+
+---
+
+## Implementation Plan
+
+### Step 1: Add `Spc700::Step()` Function
+- Add new `Step()` method to `spc700.h`
+- Implement atomic instruction execution
+- Keep `RunOpcode()` temporarily for compatibility
+
+### Step 2: Implement Precise Cycle Calculation
+- Create `CalculatePreciseCycles()` helper
+- Handle branch penalties
+- Handle page boundary crossings
+- Add tests to verify against known SPC700 timings
+
+### Step 3: Eliminate `bstep` Mechanism
+- Refactor all multi-step instructions (0xCB, 0xD0, 0xD7, etc.)
+- Remove `bstep` variable
+- Remove `step` variable
+- Verify all 256 opcodes work atomically
+
+### Step 4: Refactor `Apu::RunCycles`
+- Switch to cycle budget model
+- Use `Step()` instead of `RunOpcode()`
+- Add cycle budget logging for debugging
+
+### Step 5: Convert to Fixed-Point Ratio
+- Replace `apuCyclesPerMaster` double
+- Use integer numerator/denominator
+- Add constants for PAL timing too
+
+### Step 6: Testing
+- Test with vanilla Zelda3 ROM
+- Verify handshake completes
+- Verify music plays
+- Check for watchdog timeouts
+- Measure timing accuracy
+
+---
+
+## Files to Modify
+
+1. **src/app/emu/audio/spc700.h**
+   - Add `int Step()` method
+   - Add `int CalculatePreciseCycles(uint8_t opcode)`
+   - Remove `bstep` and `step` variables
+
+2. **src/app/emu/audio/spc700.cc**
+   - Implement `Step()`
+   - Implement `CalculatePreciseCycles()`
+   - Refactor `ExecuteInstructions()` to be atomic
+   - Remove all `bstep` logic
+
+3. **src/app/emu/audio/apu.h**
+   - Update cycle ratio constants
+
+4. **src/app/emu/audio/apu.cc**
+   - Refactor `RunCycles()` to use `Step()`
+   - Convert to fixed-point ratio
+   - Remove floating-point arithmetic
+
+5. **test/unit/spc700_timing_test.cc** (new)
+   - Test cycle accuracy for all opcodes
+   - Test handshake simulation
+   - Verify no regressions
+
+---
+
+## Success Criteria
+
+- [x] All SPC700 instructions execute atomically (one `Step()` call)
+- [x] Cycle counts accurate to ±1 cycle per instruction
+- [x] APU handshake completes without watchdog timeout
+- [x] Music loads and plays in vanilla Zelda3
+- [x] No floating-point drift over long emulation sessions
+- [ ] Unit tests pass for all 256 opcodes (future work)
+- [ ] Audio quality refined (minor glitches remain)
+
+---
+
+## Implementation Completed
+
+1. Create feature branch
+2. Analyze current implementation
+3. Implement `Spc700::Step()` function
+4. Add precise cycle calculation
+5. Refactor `Apu::RunCycles`
+6. Convert to fixed-point ratio
+7. Refactor instructions.cc to be atomic and cycle-accurate
+8. Test with Zelda3 ROM
+9. Write unit tests (future work)
+10. Fine-tune audio quality (future work)
+
+---
+
+**References:**
+- [SPC700 Opcode Reference](https://problemkaputt.de/fullsnes.htm#snesapucpu)
+- [APU Timing Documentation](https://wiki.superfamicom.org/spc700-reference)
+- docs/E6-emulator-improvements.md