12 KiB
Emulator Core Improvements Roadmap
Last Updated: October 10, 2025
Status: Active Planning
Overview
This document outlines improvements, refactors, and optimizations for the yaze emulator core. These changes aim to enhance accuracy, performance, and code maintainability.
Items are presented in order of descending priority, from critical accuracy fixes to quality-of-life improvements.
Critical Priority: APU Timing Fix
Problem Statement
The emulator's Audio Processing Unit (APU) currently fails to load and play music. Analysis shows that the SPC700 processor gets "stuck" during the initial handshake sequence with the main CPU. This handshake is responsible for uploading the sound driver from ROM to APU RAM. The failure of this timing-sensitive process prevents the sound driver from running.
Root Cause: CPU-APU Handshake Timing
The process of starting the APU and loading a sound bank requires tightly synchronized communication between the main CPU (65816) and the APU's CPU (SPC700).
The Handshake Protocol
- APU Ready: SPC700 boots, initializes, signals ready by writing
$AAto port$F4and$BBto port$F5 - CPU Waits: Main CPU waits in tight loop, reading combined 16-bit value from I/O ports until it sees
$BBAA - CPU Initiates: CPU writes command code
$CCto APU's input port - APU Acknowledges: SPC700 sees
$CCand prepares to receive data block - Synchronized Byte Transfer: CPU and APU enter lock-step loop to transfer sound driver byte-by-byte:
- CPU sends data
- CPU waits for APU to read data and echo back confirmation
- Only upon receiving confirmation does CPU send next byte
Point of Failure
The "stuck" behavior occurs because one side fails to meet the other's expectation. Due to timing desynchronization:
- The SPC700 is waiting for a byte that the CPU has not yet sent (or sent too early), OR
- The CPU is waiting for an acknowledgment that the SPC700 has already sent (or has not yet sent)
The result is an infinite loop on the SPC700, detected by the watchdog timer in Apu::RunCycles.
Technical Analysis
The handshake's reliance on precise timing exposes inaccuracies in the current SPC700 emulation model.
Issue 1: Incomplete Opcode Timing
The emulator uses a static lookup table (spc700_cycles.h) for instruction cycle counts. This provides a base value but fails to account for:
- Addressing Modes: Different addressing modes have different cycle costs
- Page Boundaries: Memory accesses crossing 256-byte page boundaries take an extra cycle
- Branching: Conditional branches take different cycle counts depending on whether branch is taken
While some of this is handled (e.g., DoBranch), it is not applied universally, leading to small, cumulative errors.
Issue 2: Fragile Multi-Step Execution Model
The step/bstep mechanism in Spc700::RunOpcode is a significant source of fragility. It attempts to model complex instructions by spreading execution across multiple calls. This means the full cycle cost of an instruction is not consumed atomically. An off-by-one error in any step corrupts the timing of the entire APU.
Issue 3: Floating-Point Precision
The use of double for the apuCyclesPerMaster ratio can introduce minute floating-point precision errors. Over thousands of cycles required for the handshake, these small errors accumulate and contribute to timing drift between CPU and APU.
Proposed Solution: Cycle-Accurate Refactoring
Step 1: Implement Cycle-Accurate Instruction Execution
The Spc700::RunOpcode function must be refactored to calculate and consume the exact cycle count for each instruction before execution.
- Calculate Exact Cost: Before running an opcode, determine its precise cycle cost by analyzing opcode, addressing mode, and potential page-boundary penalties
- Atomic Execution: Remove the
bstepmechanism. An instruction, no matter how complex, should be fully executed within a single call to a newSpc700::Step()function
Step 2: Centralize the APU Execution Loop
The main Apu::RunCycles loop should be the sole driver of APU time.
- Cycle Budget: At the start of a frame, calculate the total "budget" of APU cycles needed
- Cycle-by-Cycle Stepping: Loop, calling
Spc700::Step()andDsp::Cycle(), decrementing cycle budget until exhausted
Example of the new loop in Apu::RunCycles:
void Apu::RunCycles(uint64_t master_cycles) {
// 1. Calculate cycle budget for this frame
const uint64_t target_apu_cycles = ...;
// 2. Run the APU until the budget is met
while (cycles_ < target_apu_cycles) {
// 3. Execute one SPC700 cycle/instruction and get its true cost
int spc_cycles_consumed = spc700_.Step();
// 4. Advance DSP and Timers for each cycle consumed
for (int i = 0; i < spc_cycles_consumed; ++i) {
Cycle(); // This ticks the DSP and timers
}
}
}
Step 3: Use Integer-Based Cycle Ratios
To eliminate floating-point errors, convert the apuCyclesPerMaster ratio to a fixed-point integer ratio. This provides perfect, drift-free conversion between main CPU and APU cycles over long periods.
High Priority: Core Architecture & Timing Model
CPU Cycle Counting
- Issue: The main CPU loop in
Snes::RunCycle()advances the master cycle counter by a fixed amount (+= 2). Real 65816 instructions have variable cycle counts. The current workaround of scatteringcallbacks_.idle()calls is error-prone and difficult to maintain. - Recommendation: Refactor
Cpu::ExecuteInstructionto calculate and return the precise cycle cost of each instruction, including penalties for addressing modes and memory access speeds. The mainSnesloop should then consume this exact value, centralizing timing logic and dramatically improving accuracy.
Main Synchronization Loop
- Issue: The main loop in
Snes::RunFrame()is state-driven based on thein_vblank_flag. This can be fragile and makes it difficult to reason about component state at any given cycle. - Recommendation: Transition to a unified main loop driven by a single master cycle counter. In this model, each component (CPU, PPU, APU, DMA) is "ticked" forward based on the master clock. This is a more robust and modular architecture that simplifies component synchronization.
Medium Priority: PPU Performance
Rendering Approach Optimization
- Issue: The PPU currently uses a "pixel-based" renderer (
Ppu::RunLinecallsHandlePixelfor every pixel). This is highly accurate but can be slow due to high function call overhead and poor cache locality. - Optimization: Refactor the PPU to use a scanline-based renderer. Instead of processing one pixel at a time, process all active layers for an entire horizontal scanline, compose them into a temporary buffer, and then write the completed scanline to the framebuffer. This is a major architectural change but is a standard and highly effective optimization technique in SNES emulation.
Benefits:
- Reduced function call overhead
- Better cache locality
- Easier to vectorize/SIMD
- Standard approach in accurate SNES emulators
Low Priority: Code Quality & Refinements
APU Code Modernization
- Issue: The code in
dsp.ccandspc700.cc, inherited from other projects, is written in a very C-like style, using raw pointers,memset, and numerous "magic numbers." - Refactor: Gradually refactor this code to use modern C++ idioms:
- Replace raw arrays with
std::array - Use constructors with member initializers instead of
memset - Define
constexprvariables orenum classtypes for hardware registers and flags - Improve type safety, readability, and long-term maintainability
- Replace raw arrays with
Audio Subsystem & Buffering
- Issue: The current implementation in
Emulator::Runqueues audio samples directly to the SDL audio device. If the emulator lags for even a few frames, the audio buffer can underrun, causing audible pops and stutters. - Improvement: Implement a lock-free ring buffer (or circular buffer) to act as an intermediary. The emulator thread would continuously write generated samples into this buffer, while the audio device (in its own thread) would continuously read from it. This decouples the emulation speed from the audio hardware, smoothing out performance fluctuations and preventing stutter.
Debugger & Tooling Optimizations
DisassemblyViewer Data Structure
- Issue:
DisassemblyVieweruses astd::mapto store instruction traces. For a tool that handles frequent insertions and lookups, this can be suboptimal. - Optimization: Replace
std::mapwithstd::unordered_mapfor faster average-case performance.
BreakpointManager Lookups
- Issue: The
ShouldBreakOn...functions perform a linear scan over astd::vectorof all breakpoints. This is O(n) and could become a minor bottleneck if a very large number of breakpoints are set. - Optimization: For execution breakpoints, use a
std::unordered_set<uint32_t>for O(1) average lookup time. This would make breakpoint checking near-instantaneous, regardless of how many are active.
Completed Improvements
Audio System Fixes (v0.4.0)
Problem Statement
The SNES emulator experienced audio glitchiness and skips, particularly during the ALTTP title screen, with audible pops, crackling, and sample skipping during music playback.
Root Causes Fixed
- Aggressive Sample Dropping: Audio buffering logic was dropping up to 50% of generated samples, creating discontinuities
- Incorrect Resampling: Duplicate calculations in linear interpolation wasted CPU cycles
- Missing Frame Synchronization: DSP's
NewFrame()method was never called, causing timing drift - Missing Hermite Interpolation: Only Linear/Cosine/Cubic were available (Hermite is the industry standard)
Solutions Implemented
- Never Drop Samples: Always queue all generated samples unless buffer critically full (>4 frames)
- Fixed Resampling Code: Removed duplicate calculations and added bounds checking
- Frame Boundary Synchronization: Added
dsp.NewFrame()call before sample generation - Hermite Interpolation: New interpolation type matching bsnes/Snes9x standard
Interpolation options (src/app/emu/audio/dsp.cc):
| Interpolation | Notes |
|---|---|
| Linear | Fastest; retains legacy behaviour. |
| Hermite | New default; balances quality and speed. |
| Cosine | Smoother than linear with moderate cost. |
| Cubic | Highest quality, heavier CPU cost. |
Result: Manual testing on the ALTTP title screen, overworld theme, dungeon ambience, and menu sounds no longer exhibits audible pops or skips. Continue to monitor regression tests after the APU timing refactor lands.
Implementation Priority
- Critical (v0.4.0): APU timing fix - Required for music playback
- High (v0.5.0): CPU cycle counting accuracy - Required for game compatibility
- High (v0.5.0): Main synchronization loop refactor - Foundation for accuracy
- Medium (v0.6.0): PPU scanline renderer - Performance optimization
- Low (ongoing): Code quality improvements - Technical debt reduction
Success Metrics
APU Timing Fix Success
- Music plays in all tested games
- Sound effects work correctly
- No audio glitches or stuttering
- Handshake completes within expected cycle count
Overall Emulation Accuracy
- CPU cycle accuracy within ±1 cycle per instruction
- APU synchronized within ±1 cycle with CPU
- PPU timing accurate to scanline level
- All test ROMs pass
Performance Targets
- 60 FPS on modest hardware (2015+ laptops)
- PPU optimizations provide 20%+ speedup
- Audio buffer never underruns in normal operation
Related Documentation
docs/E4-Emulator-Development-Guide.md- Implementation detailsdocs/E1-emulator-enhancement-roadmap.md- Feature roadmapdocs/E5-debugging-guide.md- Debugging techniques
Status: Active Planning
Next Steps: Begin APU timing refactoring for v0.4.0