docs: Add APU Timing and Handshake Bug Analysis & Refactoring Plan

- Introduced a new document detailing the APU timing issue that prevents music playback due to handshake failures between the CPU and APU.
- Analyzed the CPU-APU handshake process, identifying points of failure and root causes related to cycle inaccuracies in SPC700 emulation.
- Proposed a comprehensive refactoring plan to implement cycle-accurate instruction execution, centralize the APU execution loop, and use integer-based cycle ratios to eliminate floating-point errors.
- This document serves as a critical resource for developers addressing audio emulation challenges.
This commit is contained in:
scawful
2025-10-09 20:50:24 -04:00
parent 0ba767d6fe
commit 51342b02e3
4 changed files with 185 additions and 3 deletions

View File

@@ -0,0 +1,98 @@
# APU Timing and Handshake Bug Analysis & Refactoring Plan
## 1. Problem Statement
The emulator's Audio Processing Unit (APU) currently fails to load and play music. Analysis of the execution flow indicates that the SPC700 processor gets "stuck" during the initial handshake sequence with the main CPU. This handshake is responsible for uploading the sound driver from the ROM to the APU's RAM. The failure of this timing-sensitive process prevents the sound driver from ever running, thus no music is played.
This document outlines the cause of this timing failure and proposes a refactoring plan to achieve cycle-accurate emulation required for a stable APU.
## 2. Analysis of the CPU-APU Handshake
The process of starting the APU and loading a sound bank is not a simple data transfer; it is a tightly synchronized conversation between the main CPU (65816) and the APU's CPU (SPC700).
### 2.1. The Conversation Protocol
The main CPU, executing the `LoadSongBank` routine (`#_008888` in `bank_00.asm`), and the SPC700, executing its internal IPL ROM (`bootRom` in `apu.cc`), follow a strict protocol:
1. **APU Ready**: The SPC700 boots, initializes itself, and signals it is ready by writing `$AA` to its port `$F4` and `$BB` to port `$F5`.
2. **CPU Waits**: The main CPU waits in a tight loop, reading the combined 16-bit value from its I/O ports until it sees `$BBAA`. This confirms the APU is alive.
3. **CPU Initiates**: The CPU writes the command code `$CC` to the APU's input port.
4. **APU Acknowledges**: The SPC700, which was waiting for this command, sees `$CC` and prepares to receive a block of data.
5. **Synchronized Byte Transfer**: The CPU and APU then enter a lock-step loop to transfer the sound driver byte-by-byte. For each byte:
* The CPU sends the data.
* The CPU waits for the APU to read the data and echo back a confirmation value.
* Only upon receiving the confirmation does the CPU send the next byte.
### 2.2. Point of Failure
The "stuck" behavior occurs because one side of this conversation fails to meet the other's expectation. Due to timing desynchronization, either:
* The SPC700 is waiting for a byte that the CPU has not yet sent (or sent too early).
* The CPU is waiting for an acknowledgment that the SPC700 has already sent (or has not yet sent).
The result is an infinite loop on the SPC700, which is what the watchdog timer in `Apu::RunCycles` detects.
## 3. Root Cause: Cycle Inaccuracy in SPC700 Emulation
The handshake's reliance on precise timing exposes inaccuracies in the current SPC700 emulation model. The core issue is that the emulator does not calculate the *exact* number of cycles each SPC700 instruction consumes.
### 3.1. Incomplete Opcode Timing
The emulator uses a static lookup table (`spc700_cycles.h`) for instruction cycle counts. This provides a *base* value but fails to account for critical variations:
* **Addressing Modes**: Different addressing modes have different cycle costs.
* **Page Boundaries**: Memory accesses that cross a 256-byte page boundary take an extra cycle.
* **Branching**: Conditional branches take a different number of cycles depending on whether the branch is taken or not.
While some of this is handled (e.g., `DoBranch`), it is not applied universally, leading to small, cumulative errors.
### 3.2. Fragile Multi-Step Execution Model
The `step`/`bstep` mechanism in `Spc700::RunOpcode` is a significant source of fragility. It attempts to model complex instructions by spreading their execution across multiple calls. This means the full cycle cost of an instruction is not consumed atomically. An off-by-one error in any step, or an incorrect transition, can corrupt the timing of the entire APU, causing the handshake to fail.
### 3.3. Floating-Point Precision
The use of a `double` for the `apuCyclesPerMaster` ratio can introduce minute floating-point precision errors. Over the thousands of cycles required for the handshake, these small errors can accumulate and contribute to the overall timing drift between the CPU and APU.
## 4. Proposed Refactoring Plan
To resolve this, the APU emulation must be refactored from its current instruction-based timing model to a more robust **cycle-accurate model**. This is the standard approach for emulating timing-sensitive hardware.
### Step 1: Implement Cycle-Accurate Instruction Execution
The `Spc700::RunOpcode` function must be refactored to calculate and consume the *exact* cycle count for each instruction *before* execution.
* **Calculate Exact Cost**: Before running an opcode, determine its precise cycle cost by analyzing its opcode, addressing mode, and potential page-boundary penalties. The `spc700_cycles.h` table should be used as a base, with additional cycles added as needed.
* **Atomic Execution**: The `bstep` mechanism should be removed. An instruction, no matter how complex, should be fully executed within a single call to a new `Spc700::Step()` function. This function will be responsible for consuming the exact number of cycles it calculated.
### Step 2: Centralize the APU Execution Loop
The main `Apu::RunCycles` loop should be the sole driver of APU time.
* **Cycle Budget**: At the start of a frame, `Apu::RunCycles` will calculate the total "budget" of APU cycles it needs to execute.
* **Cycle-by-Cycle Stepping**: It will then loop, calling `Spc700::Step()` and `Dsp::Cycle()` and decrementing its cycle budget, until the budget is exhausted. This ensures the SPC700 and DSP are always perfectly synchronized.
**Example of the new loop in `Apu::RunCycles`:**
```cpp
void Apu::RunCycles(uint64_t master_cycles) {
// 1. Calculate cycle budget for this frame
const uint64_t target_apu_cycles = ...;
// 2. Run the APU until the budget is met
while (cycles_ < target_apu_cycles) {
// 3. Execute one SPC700 cycle/instruction and get its true cost
int spc_cycles_consumed = spc700_.Step();
// 4. Advance DSP and Timers for each cycle consumed
for (int i = 0; i < spc_cycles_consumed; ++i) {
Cycle(); // This ticks the DSP and timers
}
}
}
```
### Step 3: Use Integer-Based Cycle Ratios
To eliminate floating-point errors, the `apuCyclesPerMaster` ratio should be converted to a fixed-point integer ratio. This provides perfect, drift-free conversion between main CPU and APU cycles over long periods.
## 5. Conclusion
The APU handshake failure is a classic and challenging emulation problem that stems directly from timing inaccuracies. By refactoring the SPC700 execution loop to be cycle-accurate, we can ensure the emulated CPU and APU remain perfectly synchronized, allowing the handshake to complete successfully and enabling music playback.

View File

@@ -0,0 +1,59 @@
# Emulator Core Improvements & Optimization Roadmap
## 1. Introduction
This document outlines potential long-term improvements, refactors, and optimizations for the `yaze` emulator core. These suggestions aim to enhance accuracy, performance, and code maintainability.
For a detailed analysis of the critical, high-priority bug affecting audio, please first refer to the separate document:
- **[APU Timing and Handshake Bug Analysis & Refactoring Plan](./APU_Timing_Fix_Plan.md)**
The items below are presented in order of descending priority, from critical accuracy fixes to quality-of-life code improvements.
## 2. Core Architecture & Timing Model (High Priority)
The most significant improvements relate to the fundamental emulation loop and cycle timing, which are the root cause of the current audio bug and affect overall system stability.
* **CPU Cycle Counting:**
* **Issue:** The main CPU loop in `Snes::RunCycle()` advances the master cycle counter by a fixed amount (`+= 2`). Real 65816 instructions have variable cycle counts. The current workaround of scattering `callbacks_.idle()` calls is error-prone and difficult to maintain.
* **Recommendation:** Refactor `Cpu::ExecuteInstruction` to calculate and return the *precise* cycle cost of each instruction, including penalties for addressing modes and memory access speeds. The main `Snes` loop should then consume this exact value, centralizing timing logic and dramatically improving accuracy.
* **Main Synchronization Loop:**
* **Issue:** The main loop in `Snes::RunFrame()` is state-driven based on the `in_vblank_` flag. This can be fragile and makes it difficult to reason about the state of all components at any given cycle.
* **Recommendation:** Transition to a unified main loop that is driven by a single master cycle counter. In this model, each component (CPU, PPU, APU, DMA) is "ticked" forward based on the master clock. This is a more robust and modular architecture that simplifies component synchronization.
## 3. PPU (Video Rendering) Performance
The Picture Processing Unit (PPU) is often a performance bottleneck. The following change could provide a significant speed boost.
* **Rendering Approach:**
* **Issue:** The PPU currently uses a "pixel-based" renderer (`Ppu::RunLine` calls `HandlePixel` for every pixel). This is highly accurate but can be slow due to high function call overhead and poor cache locality.
* **Optimization:** Refactor the PPU to use a **scanline-based renderer**. Instead of processing one pixel at a time, this approach processes all active layers for an entire horizontal scanline, composes them into a temporary buffer, and then writes the completed scanline to the framebuffer. This is a major architectural change but is a standard and highly effective optimization technique in SNES emulation.
## 4. APU (Audio) Code Quality & Refinements
Beyond the critical timing fixes, the APU core would benefit from modernization.
* **Code Style:**
* **Issue:** The code in `dsp.cc` and `spc700.cc`, inherited from other projects, is written in a very C-like style, using raw pointers, `memset`, and numerous "magic numbers."
* **Refactor:** Gradually refactor this code to use modern C++ idioms. Replace raw arrays with `std::array`, use constructors with member initializers instead of `memset`, and define `constexpr` variables or `enum class` types for hardware registers and flags. This will improve type safety, readability, and long-term maintainability.
## 5. Audio Subsystem & Buffering
To ensure smooth, stutter-free audio, the interface between the emulator and the host audio API can be improved.
* **Audio Buffering Strategy:**
* **Issue:** The current implementation in `Emulator::Run` queues audio samples directly to the SDL audio device. If the emulator lags for even a few frames, the audio buffer can underrun, causing audible pops and stutters.
* **Improvement:** Implement a **lock-free ring buffer (or circular buffer)** to act as an intermediary. The emulator thread would continuously write generated samples into this buffer, while the audio device (in its own thread) would continuously read from it. This decouples the emulation speed from the audio hardware, smoothing out performance fluctuations and preventing stutter.
## 6. Debugger & Tooling Optimizations (Lower Priority)
These are minor optimizations that would improve the performance of the debugging tools, especially under heavy use.
* **`DisassemblyViewer` Data Structure:**
* **Issue:** `DisassemblyViewer` uses a `std::map` to store instruction traces. For a tool that handles frequent insertions and lookups, this can be suboptimal.
* **Optimization:** Replace `std::map` with `std::unordered_map` for faster average-case performance.
* **`BreakpointManager` Lookups:**
* **Issue:** The `ShouldBreakOn...` functions perform a linear scan over a `std::vector` of all breakpoints. This is O(n) and could become a minor bottleneck if a very large number of breakpoints are set.
* **Optimization:** For execution breakpoints, use a `std::unordered_set<uint32_t>` for O(1) average lookup time. This would make breakpoint checking near-instantaneous, regardless of how many are active.

View File

@@ -39,7 +39,7 @@ if(YAZE_BUILD_TESTS AND NOT YAZE_BUILD_TESTS STREQUAL "OFF")
unit/zelda3/dungeon_component_unit_test.cc
unit/zelda3/dungeon/room_object_encoding_test.cc
unit/zelda3/dungeon/room_manipulation_test.cc
unit/zelda3/dungeon_object_renderer_mock_test.cc
# dungeon_object_renderer_mock_test.cc - REMOVED (ObjectRenderer obsolete)
# CLI Services (for catalog serialization tests)
../src/cli/service/resources/resource_catalog.cc
@@ -65,7 +65,7 @@ if(YAZE_BUILD_TESTS AND NOT YAZE_BUILD_TESTS STREQUAL "OFF")
# Integration Tests (Zelda3)
integration/zelda3/overworld_integration_test.cc
integration/zelda3/dungeon_editor_system_integration_test.cc
integration/zelda3/dungeon_object_renderer_integration_test.cc
# dungeon_object_renderer_integration_test.cc - REMOVED (ObjectRenderer obsolete)
integration/zelda3/room_integration_test.cc
integration/zelda3/dungeon_object_rendering_tests.cc
integration/zelda3/dungeon_room_test.cc

View File

@@ -50,12 +50,23 @@ struct TestConfig {
bool skip_rom_tests = false;
bool enable_ui_tests = false;
bool show_gui = false;
ImGuiTestRunSpeed test_speed = ImGuiTestRunSpeed_Fast;
};
// Parse command line arguments for better AI agent testing support
TestConfig ParseArguments(int argc, char* argv[]) {
TestConfig config;
std::cout << "Available options:\n"
<< " --ui : Enable UI tests\n"
<< " --show-gui : Show GUI during tests\n"
<< " --fast : Run tests at max speed (default)\n"
<< " --normal : Run tests at watchable speed\n"
<< " --cinematic : Run tests in slow-motion with pauses\n"
<< " --rom=<path> : Specify ROM file path\n"
<< " --pattern=<pat> : Run tests matching pattern\n"
<< std::endl;
for (int i = 1; i < argc; i++) {
std::string arg = argv[i];
@@ -114,6 +125,14 @@ TestConfig ParseArguments(int argc, char* argv[]) {
config.verbose = true;
} else if (arg == "--show-gui") {
config.show_gui = true;
} else if (arg == "--fast") {
config.test_speed = ImGuiTestRunSpeed_Fast;
} else if (arg == "--normal") {
config.test_speed = ImGuiTestRunSpeed_Normal;
} else if (arg == "--cinematic") {
config.test_speed = ImGuiTestRunSpeed_Cinematic;
} else if (arg == "--ui") {
config.enable_ui_tests = true;
} else if (arg.find("--") != 0) {
// Test pattern (not a flag)
config.mode = TestMode::kSpecific;
@@ -275,9 +294,15 @@ int main(int argc, char* argv[]) {
// Setup test engine
ImGuiTestEngine* engine = ImGuiTestEngine_CreateContext();
ImGuiTestEngineIO& test_io = ImGuiTestEngine_GetIO(engine);
test_io.ConfigRunSpeed = ImGuiTestRunSpeed_Fast;
test_io.ConfigRunSpeed = config.test_speed; // Use configured speed
test_io.ConfigVerboseLevel = ImGuiTestVerboseLevel_Info;
test_io.ConfigVerboseLevelOnError = ImGuiTestVerboseLevel_Debug;
// Log test speed mode
const char* speed_name = "Fast";
if (config.test_speed == ImGuiTestRunSpeed_Normal) speed_name = "Normal";
else if (config.test_speed == ImGuiTestRunSpeed_Cinematic) speed_name = "Cinematic";
std::cout << "Running tests in " << speed_name << " mode" << std::endl;
yaze::core::Controller controller;