From 06c613804e683873d2a5187019d14de8e81dd096 Mon Sep 17 00:00:00 2001 From: scawful Date: Sat, 4 Oct 2025 03:05:33 -0400 Subject: [PATCH] feat: Remove outdated performance analysis documents and update optimization summaries for dungeon and overworld loading --- .../dungeon_parallel_optimization_summary.md | 137 ---------- .../overworld_load_optimization_analysis.md | 252 ------------------ .../performance_optimization_summary.md | 223 ++-------------- .../renderer_optimization_analysis.md | 143 ---------- 4 files changed, 27 insertions(+), 728 deletions(-) delete mode 100644 docs/analysis/dungeon_parallel_optimization_summary.md delete mode 100644 docs/analysis/overworld_load_optimization_analysis.md delete mode 100644 docs/analysis/renderer_optimization_analysis.md diff --git a/docs/analysis/dungeon_parallel_optimization_summary.md b/docs/analysis/dungeon_parallel_optimization_summary.md deleted file mode 100644 index 468837c8..00000000 --- a/docs/analysis/dungeon_parallel_optimization_summary.md +++ /dev/null @@ -1,137 +0,0 @@ -# DungeonEditor Parallel Optimization Implementation - -## ๐Ÿš€ **Parallelization Strategy Implemented** - -### **Problem Identified** -- **DungeonEditor::LoadAllRooms**: **17,966ms (17.97 seconds)** - 99.9% of loading time -- Loading **296 rooms** sequentially, each involving complex operations -- Perfect candidate for parallelization due to independent room processing - -### **Solution: Multi-Threaded Room Loading** - -#### **Key Optimizations** - -1. **Parallel Room Processing** - ```cpp - // Load 296 rooms using up to 8 threads - const int max_concurrency = std::min(8, std::thread::hardware_concurrency()); - const int rooms_per_thread = (296 + max_concurrency - 1) / max_concurrency; - ``` - -2. **Thread-Safe Result Collection** - ```cpp - std::mutex results_mutex; - std::vector> room_size_results; - std::vector> room_palette_results; - ``` - -3. **Optimized Thread Distribution** - - **8 threads maximum** (reasonable limit for room loading) - - **~37 rooms per thread** (296 รท 8 = 37 rooms per thread) - - **Hardware concurrency aware** (adapts to available CPU cores) - -#### **Parallel Processing Flow** - -```cpp -// Each thread processes a batch of rooms -for (int i = start_room; i < end_room; ++i) { - // 1. Load room data (expensive operation) - rooms[i] = zelda3::LoadRoomFromRom(rom_, i); - - // 2. Calculate room size - auto room_size = zelda3::CalculateRoomSize(rom_, i); - - // 3. Load room objects - rooms[i].LoadObjects(); - - // 4. Process palette (thread-safe collection) - // ... palette processing ... -} -``` - -#### **Thread Safety Features** - -1. **Mutex Protection**: `std::mutex results_mutex` protects shared data structures -2. **Lock Guards**: `std::lock_guard` ensures thread-safe result collection -3. **Independent Processing**: Each thread works on different room ranges -4. **Synchronized Results**: Results collected and sorted on main thread - -### **Expected Performance Impact** - -#### **Theoretical Speedup** -- **8x faster** with 8 threads (ideal case) -- **Realistic expectation**: **4-6x speedup** due to: - - Thread creation overhead - - Mutex contention - - Memory bandwidth limitations - - Cache coherency issues - -#### **Expected Results** -- **Before**: 17,966ms (17.97 seconds) -- **After**: **2,000-4,500ms (2-4.5 seconds)** -- **Total Loading Time**: **2.5-5 seconds** (down from 18.6 seconds) -- **Overall Improvement**: **70-85% reduction** in loading time - -### **Technical Implementation Details** - -#### **Thread Management** -```cpp -std::vector> futures; - -for (int thread_id = 0; thread_id < max_concurrency; ++thread_id) { - auto task = [this, &rooms, thread_id, rooms_per_thread, ...]() -> absl::Status { - // Process room batch - return absl::OkStatus(); - }; - - futures.emplace_back(std::async(std::launch::async, task)); -} - -// Wait for all threads to complete -for (auto& future : futures) { - RETURN_IF_ERROR(future.get()); -} -``` - -#### **Result Processing** -```cpp -// Sort results by room ID for consistent ordering -std::sort(room_size_results.begin(), room_size_results.end(), - [](const auto& a, const auto& b) { return a.first < b.first; }); - -// Process collected results on main thread -for (const auto& [room_id, room_size] : room_size_results) { - room_size_pointers_.push_back(room_size.room_size_pointer); - // ... process results ... -} -``` - -### **Monitoring and Validation** - -#### **Performance Timing Added** -- **DungeonRoomLoader::PostProcessResults**: Measures result processing time -- **Thread creation overhead**: Minimal compared to room loading time -- **Result collection time**: Expected to be <100ms - -#### **Logging and Debugging** -```cpp -util::logf("Loading %d dungeon rooms using %d threads (%d rooms per thread)", - kTotalRooms, max_concurrency, rooms_per_thread); -``` - -### **Benefits of This Approach** - -1. **Massive Performance Gain**: 70-85% reduction in loading time -2. **Scalable**: Automatically adapts to available CPU cores -3. **Thread-Safe**: Proper synchronization prevents data corruption -4. **Maintainable**: Clean separation of parallel processing and result collection -5. **Robust**: Error handling per thread with proper status propagation - -### **Next Steps** - -1. **Test Performance**: Run application and measure actual speedup -2. **Validate Results**: Ensure room data integrity is maintained -3. **Fine-tune**: Adjust thread count if needed based on results -4. **Monitor**: Watch for any threading issues or performance regressions - -This parallel optimization should transform YAZE from a slow-loading application to a lightning-fast ROM editor! diff --git a/docs/analysis/overworld_load_optimization_analysis.md b/docs/analysis/overworld_load_optimization_analysis.md deleted file mode 100644 index eecb3ef9..00000000 --- a/docs/analysis/overworld_load_optimization_analysis.md +++ /dev/null @@ -1,252 +0,0 @@ -# Overworld::Load Performance Analysis and Optimization Plan - -## Current Performance Profile - -Based on the performance report, `Overworld::Load` takes **2887.91ms (2.9 seconds)**, making it the primary bottleneck in ROM loading. - -## Detailed Analysis of Overworld::Load - -### Current Implementation Breakdown - -```cpp -absl::Status Overworld::Load(Rom* rom) { - // 1. Tile Assembly (CPU-bound) - RETURN_IF_ERROR(AssembleMap32Tiles()); // ~200-400ms - RETURN_IF_ERROR(AssembleMap16Tiles()); // ~100-200ms - - // 2. Decompression (CPU-bound, memory-intensive) - DecompressAllMapTiles(); // ~1500-2000ms (MAJOR BOTTLENECK) - - // 3. Map Object Creation (fast) - for (int map_index = 0; map_index < kNumOverworldMaps; ++map_index) - overworld_maps_.emplace_back(map_index, rom_); - - // 4. Map Parent Assignment (fast) - for (int map_index = 0; map_index < kNumOverworldMaps; ++map_index) { - map_parent_[map_index] = overworld_maps_[map_index].parent(); - } - - // 5. Map Size Assignment (fast) - if (asm_version >= 3) { - AssignMapSizes(overworld_maps_); - } else { - FetchLargeMaps(); - } - - // 6. Data Loading (moderate) - LoadTileTypes(); // ~50-100ms - RETURN_IF_ERROR(LoadEntrances()); // ~100-200ms - RETURN_IF_ERROR(LoadHoles()); // ~50ms - RETURN_IF_ERROR(LoadExits()); // ~100-200ms - RETURN_IF_ERROR(LoadItems()); // ~100-200ms - RETURN_IF_ERROR(LoadOverworldMaps()); // ~200-500ms (already parallelized) - RETURN_IF_ERROR(LoadSprites()); // ~200-400ms -} -``` - -## Major Bottlenecks Identified - -### 1. **DecompressAllMapTiles() - PRIMARY BOTTLENECK (~1.5-2.0 seconds)** - -**Current Implementation Issues:** -- Sequential processing of 160 overworld maps -- Each map calls `HyruleMagicDecompress()` twice (high/low pointers) -- 320 decompression operations total -- Each decompression involves complex algorithm with nested loops - -**Performance Impact:** -```cpp -for (int i = 0; i < kNumOverworldMaps; i++) { // 160 iterations - // Two expensive decompression calls per map - auto bytes = gfx::HyruleMagicDecompress(rom()->data() + p2, &size1, 1); // ~5-10ms each - auto bytes2 = gfx::HyruleMagicDecompress(rom()->data() + p1, &size2, 1); // ~5-10ms each - OrganizeMapTiles(bytes, bytes2, i, sx, sy, ttpos); // ~2-5ms each -} -``` - -### 2. **AssembleMap32Tiles() - SECONDARY BOTTLENECK (~200-400ms)** - -**Current Implementation Issues:** -- Sequential processing of tile32 data -- Multiple ROM reads per tile -- Complex tile assembly logic - -### 3. **AssembleMap16Tiles() - MODERATE BOTTLENECK (~100-200ms)** - -**Current Implementation Issues:** -- Sequential processing of tile16 data -- Multiple ROM reads per tile -- Tile info processing - -## Optimization Strategies - -### 1. **Parallelize Decompression Operations** - -**Strategy:** Process multiple maps concurrently during decompression - -```cpp -absl::Status DecompressAllMapTilesParallel() { - constexpr int kMaxConcurrency = std::thread::hardware_concurrency(); - constexpr int kMapsPerBatch = kNumOverworldMaps / kMaxConcurrency; - - std::vector> futures; - - for (int batch = 0; batch < kMaxConcurrency; ++batch) { - auto task = [this, batch, kMapsPerBatch]() { - int start = batch * kMapsPerBatch; - int end = std::min(start + kMapsPerBatch, kNumOverworldMaps); - - for (int i = start; i < end; ++i) { - // Process map i decompression - ProcessMapDecompression(i); - } - }; - futures.emplace_back(std::async(std::launch::async, task)); - } - - // Wait for all batches to complete - for (auto& future : futures) { - future.wait(); - } - - return absl::OkStatus(); -} -``` - -**Expected Improvement:** 60-80% reduction in decompression time (2.0s โ†’ 0.4-0.8s) - -### 2. **Optimize ROM Access Patterns** - -**Strategy:** Batch ROM reads and cache frequently accessed data - -```cpp -// Cache ROM data in memory to reduce I/O overhead -class RomDataCache { - private: - std::unordered_map> cache_; - const Rom* rom_; - - public: - const std::vector& GetData(uint32_t offset, size_t size) { - auto it = cache_.find(offset); - if (it == cache_.end()) { - auto data = rom_->ReadBytes(offset, size); - cache_[offset] = std::move(data); - return cache_[offset]; - } - return it->second; - } -}; -``` - -**Expected Improvement:** 10-20% reduction in ROM access time - -### 3. **Implement Lazy Map Loading** - -**Strategy:** Only load maps that are immediately needed - -```cpp -absl::Status Overworld::LoadEssentialMaps() { - // Only load first few maps initially - constexpr int kInitialMapCount = 8; - - RETURN_IF_ERROR(AssembleMap32Tiles()); - RETURN_IF_ERROR(AssembleMap16Tiles()); - - // Load only essential maps - DecompressEssentialMaps(kInitialMapCount); - - // Load remaining maps in background - StartBackgroundMapLoading(); - - return absl::OkStatus(); -} -``` - -**Expected Improvement:** 70-80% reduction in initial loading time (2.9s โ†’ 0.6-0.9s) - -### 4. **Optimize HyruleMagicDecompress** - -**Strategy:** Profile and optimize the decompression algorithm - -**Current Algorithm Complexity:** -- Nested loops with O(nยฒ) complexity in worst case -- Multiple memory allocations and reallocations -- String matching operations - -**Potential Optimizations:** -- Pre-allocate buffers to avoid reallocations -- Optimize string matching with better algorithms -- Use SIMD instructions for bulk operations -- Cache decompression results for identical data - -**Expected Improvement:** 20-40% reduction in decompression time - -### 5. **Memory Pool Optimization** - -**Strategy:** Use memory pools for frequent allocations - -```cpp -class DecompressionMemoryPool { - private: - std::vector> buffers_; - size_t buffer_size_; - - public: - uint8_t* AllocateBuffer(size_t size) { - // Reuse existing buffers or allocate new ones - if (size <= buffer_size_) { - // Return existing buffer - } else { - // Allocate new buffer - } - } - - void ReleaseBuffer(uint8_t* buffer) { - // Return buffer to pool - } -}; -``` - -## Implementation Priority - -### Phase 1: High Impact, Low Risk (Immediate) -1. **Parallelize DecompressAllMapTiles** - Biggest performance gain -2. **Implement lazy loading for non-essential maps** -3. **Add performance monitoring to identify remaining bottlenecks** - -### Phase 2: Medium Impact, Medium Risk (Next) -1. **Optimize ROM access patterns** -2. **Implement memory pooling for decompression** -3. **Profile and optimize HyruleMagicDecompress** - -### Phase 3: Lower Impact, Higher Risk (Future) -1. **Rewrite decompression algorithm with SIMD** -2. **Implement advanced caching strategies** -3. **Consider alternative data formats for faster loading** - -## Expected Performance Improvements - -### Conservative Estimates -- **Current:** 2887ms total loading time -- **After Phase 1:** 800-1200ms (60-70% improvement) -- **After Phase 2:** 500-800ms (70-80% improvement) -- **After Phase 3:** 300-500ms (80-85% improvement) - -### Aggressive Estimates -- **Current:** 2887ms total loading time -- **After Phase 1:** 600-900ms (70-80% improvement) -- **After Phase 2:** 300-500ms (80-85% improvement) -- **After Phase 3:** 200-400ms (85-90% improvement) - -## Conclusion - -The primary optimization opportunity is in `DecompressAllMapTiles()`, which represents the majority of the loading time. By implementing parallel processing and lazy loading, we can achieve significant performance improvements while maintaining code reliability. - -The optimizations should focus on: -1. **Parallelization** of CPU-bound operations -2. **Lazy loading** of non-essential data -3. **Memory optimization** to reduce allocation overhead -4. **ROM access optimization** to reduce I/O bottlenecks - -These changes will dramatically improve the user experience during ROM loading while maintaining the same functionality and data integrity. diff --git a/docs/analysis/performance_optimization_summary.md b/docs/analysis/performance_optimization_summary.md index 3d183f57..bb8ab224 100644 --- a/docs/analysis/performance_optimization_summary.md +++ b/docs/analysis/performance_optimization_summary.md @@ -14,217 +14,48 @@ ### 1. **Performance Monitoring System with Feature Flag** -#### **Features Added** -- **Feature Flag Control**: `kEnablePerformanceMonitoring` in FeatureFlags -- **Zero-Overhead When Disabled**: ScopedTimer becomes no-op when monitoring is off -- **UI Toggle**: Performance monitoring can be enabled/disabled in Settings - -#### **Implementation** -```cpp -// Feature flag integration -ScopedTimer::ScopedTimer(const std::string& operation_name) - : operation_name_(operation_name), - enabled_(core::FeatureFlags::get().kEnablePerformanceMonitoring) { - if (enabled_) { - PerformanceMonitor::Get().StartTimer(operation_name_); - } -} -``` +- **Feature Flag Control**: `kEnablePerformanceMonitoring` in FeatureFlags allows enabling/disabling the system. +- **Zero-Overhead When Disabled**: `ScopedTimer` becomes a no-op when monitoring is off. +- **UI Toggle**: Performance monitoring can be toggled in the Settings UI. ### 2. **DungeonEditor Parallel Loading (79% Speedup)** -#### **Problem Solved** -- **DungeonEditor::LoadAllRooms**: 17,966ms โ†’ 3,746ms -- Loading 296 rooms sequentially was the primary bottleneck - -#### **Solution: Multi-Threaded Room Loading** -```cpp -// Parallel processing with up to 8 threads -const int max_concurrency = std::min(8, std::thread::hardware_concurrency()); -const int rooms_per_thread = (296 + max_concurrency - 1) / max_concurrency; - -// Each thread processes ~37 rooms independently -for (int i = start_room; i < end_room; ++i) { - rooms[i] = zelda3::LoadRoomFromRom(rom_, i); - rooms[i].LoadObjects(); - // ... other room processing -} -``` - -#### **Key Features** -- **Thread-Safe Result Collection**: Mutex-protected shared data structures -- **Hardware-Aware**: Automatically adapts to available CPU cores -- **Error Handling**: Proper status propagation per thread -- **Result Synchronization**: Main thread processes collected results +- **Problem Solved**: Loading 296 rooms sequentially was the primary bottleneck, taking ~18 seconds. +- **Solution**: Implemented multi-threaded room loading, using up to 8 threads to process rooms in parallel. This includes thread-safe collection of results and hardware-aware concurrency. ### 3. **Incremental Overworld Map Loading** -#### **Problem Solved** -- Blank maps visible during loading -- All maps loaded upfront causing UI blocking - -#### **Solution: Priority-Based Incremental Loading** -```cpp -// Increased from 2 to 8 textures per frame -const int textures_per_frame = 8; - -// Priority system: current world maps first -if (is_current_world || processed < textures_per_frame / 2) { - Renderer::Get().RenderBitmap(*it); - processed++; -} -``` - -#### **Key Features** -- **Priority Loading**: Current world maps load first -- **4x Faster Texture Creation**: 8 textures per frame vs 2 -- **Loading Indicators**: "Loading..." placeholders for pending maps -- **Graceful Degradation**: Only draws maps with textures +- **Problem Solved**: UI would block and show blank maps while all 160 overworld maps were loaded upfront. +- **Solution**: Implemented a priority-based incremental loading system. It creates textures for the current world's maps first, at a 4x faster rate (8 per frame), while showing "Loading..." placeholders for the rest. ### 4. **On-Demand Map Reloading** -#### **Problem Solved** -- Full map refresh on every property change -- Expensive rebuilds for non-visible maps +- **Problem Solved**: Any property change would trigger an expensive full map refresh, even for non-visible maps. +- **Solution**: An intelligent refresh system now only reloads maps that are currently visible. Changes to non-visible maps are deferred until they are viewed. -#### **Solution: Intelligent Refresh System** -```cpp -void RefreshOverworldMapOnDemand(int map_index) { - // Only refresh visible maps immediately - bool is_current_map = (map_index == current_map_); - bool is_current_world = (map_index / 0x40 == current_world_); - - if (!is_current_map && !is_current_world) { - // Defer refresh for non-visible maps - maps_bmp_[map_index].set_modified(true); - return; - } - - // Immediate refresh for visible maps - RefreshChildMapOnDemand(map_index); -} -``` +--- -#### **Key Features** -- **Visibility-Aware**: Only refreshes visible maps immediately -- **Deferred Processing**: Non-visible maps marked for later refresh -- **Selective Updates**: Only rebuilds changed components -- **Smart Sibling Handling**: Large map siblings refreshed intelligently +## Appendix A: Dungeon Editor Parallel Optimization -## ๐ŸŽฏ **Technical Architecture** +- **Problem Identified**: `DungeonEditor::LoadAllRooms` took **17.97 seconds**, accounting for 99.9% of loading time. +- **Strategy**: The 296 independent rooms were loaded in parallel across up to 8 threads (~37 rooms per thread). +- **Implementation**: Used `std::async` to launch tasks and `std::mutex` to safely collect results (like room size and palette data). Results are sorted on the main thread for consistency. +- **Result**: Loading time for the dungeon editor was reduced by **79%** to ~3.7 seconds. -### **Performance Monitoring System** -``` -FeatureFlags::kEnablePerformanceMonitoring - โ†“ (enabled/disabled) -ScopedTimer (no-op when disabled) - โ†“ (when enabled) -PerformanceMonitor::StartTimer/EndTimer - โ†“ -Operation timing collection - โ†“ -Performance summary output -``` +--- -### **Parallel Loading Architecture** -``` -Main Thread - โ†“ -Spawn 8 Worker Threads - โ†“ (parallel) -Thread 1: Rooms 0-36 Thread 2: Rooms 37-73 ... Thread 8: Rooms 259-295 - โ†“ (thread-safe collection) -Mutex-Protected Results - โ†“ (main thread) -Result Processing & Sorting - โ†“ -Map Population -``` +## Appendix B: Overworld Load Optimization -### **Incremental Loading Flow** -``` -ROM Load Start - โ†“ -Essential Maps (8 per world) โ†’ Immediate Texture Creation -Non-Essential Maps โ†’ Deferred Texture Creation - โ†“ (per frame) -ProcessDeferredTextures() - โ†“ (priority-based) -Current World Maps First โ†’ Other Maps - โ†“ -Loading Indicators for Pending Maps -``` +- **Problem Identified**: `Overworld::Load` took **2.9 seconds**, with the main bottleneck being the sequential decompression of 160 map tiles (`DecompressAllMapTiles`). +- **Strategy**: Parallelize the decompression operations and implement lazy loading for maps that are not immediately visible. +- **Implementation**: The plan involves using `std::async` to decompress map batches concurrently and creating a system to only load essential maps on startup, deferring the rest to a background process. +- **Expected Result**: A 70-80% reduction in initial overworld loading time. -## ๐Ÿ“ˆ **Performance Impact Analysis** +--- -### **DungeonEditor Optimization** -- **Before**: 17,967ms (single-threaded) -- **After**: 3,747ms (8-threaded) -- **Speedup**: 4.8x theoretical, 4.0x actual (due to overhead) -- **Efficiency**: 83% of theoretical maximum +## Appendix C: Renderer Optimization -### **OverworldEditor Optimization** -- **Loading Time**: Reduced from blocking to progressive -- **Texture Creation**: 4x faster (8 vs 2 per frame) -- **User Experience**: No more blank maps, smooth loading -- **Memory Usage**: Reduced initial footprint - -### **Overall System Impact** -- **Total Loading Time**: 18.6s โ†’ 4.7s (75% reduction) -- **UI Responsiveness**: Near-instant vs 18-second freeze -- **Memory Efficiency**: Reduced initial allocations -- **CPU Utilization**: Better multi-core usage - -## ๐Ÿ”ง **Configuration Options** - -### **Performance Monitoring** -```cpp -// Enable/disable in UI or code -FeatureFlags::get().kEnablePerformanceMonitoring = true/false; - -// Zero overhead when disabled -ScopedTimer timer("Operation"); // No-op when monitoring disabled -``` - -### **Parallel Loading Tuning** -```cpp -// Adjust thread count based on system -constexpr int kMaxConcurrency = 8; // Reasonable default -const int max_concurrency = std::min(kMaxConcurrency, - std::thread::hardware_concurrency()); -``` - -### **Incremental Loading Tuning** -```cpp -// Adjust textures per frame based on performance -const int textures_per_frame = 8; // Balance between speed and UI responsiveness -``` - -## ๐ŸŽฏ **Future Optimization Opportunities** - -### **Potential Further Improvements** -1. **Memory-Mapped ROM Access**: Reduce memory copying during loading -2. **Background Thread Pool**: Reuse threads across operations -3. **Predictive Loading**: Load likely-to-be-accessed maps in advance -4. **Compression Caching**: Cache decompressed data for faster subsequent loads -5. **GPU-Accelerated Texture Creation**: Move texture creation to GPU - -### **Monitoring and Profiling** -1. **Real-Time Performance Metrics**: In-app performance dashboard -2. **Memory Usage Tracking**: Monitor memory allocations during loading -3. **Thread Utilization Metrics**: Track CPU core usage efficiency -4. **User Interaction Timing**: Measure time to interactive - -## โœ… **Success Metrics Achieved** - -- โœ… **75% reduction** in total loading time (18.6s โ†’ 4.7s) -- โœ… **79% improvement** in DungeonEditor loading (17.9s โ†’ 3.7s) -- โœ… **Zero-overhead** performance monitoring when disabled -- โœ… **Smooth incremental loading** with visual feedback -- โœ… **Intelligent on-demand refreshing** for better responsiveness -- โœ… **Multi-threaded architecture** utilizing all CPU cores -- โœ… **Backward compatibility** maintained throughout - -## ๐Ÿš€ **Result: Lightning-Fast YAZE** - -YAZE has been transformed from a slow-loading application with 18-second freezes to a **lightning-fast ROM editor** that loads in under 5 seconds with smooth, progressive loading and intelligent resource management. The optimizations provide both immediate performance gains and a foundation for future enhancements. +- **Problem Identified**: The original renderer created GPU textures synchronously on the main thread for all 160 overworld maps, blocking the UI for several seconds. +- **Strategy**: Defer texture creation. Bitmaps and surface data are prepared first (a CPU-bound task that can be backgrounded), while the actual GPU texture creation (a main-thread-only task) is done progressively or on-demand. +- **Implementation**: A `CreateBitmapWithoutTexture` method was introduced. A lazy loading system (`ProcessDeferredTextures`) processes a few textures per frame to avoid blocking, and `EnsureMapTexture` creates a texture immediately if a map becomes visible. +- **Result**: A much more responsive UI during ROM loading, with an initial load time of only ~200-500ms. \ No newline at end of file diff --git a/docs/analysis/renderer_optimization_analysis.md b/docs/analysis/renderer_optimization_analysis.md deleted file mode 100644 index ecdb5669..00000000 --- a/docs/analysis/renderer_optimization_analysis.md +++ /dev/null @@ -1,143 +0,0 @@ -# Renderer Class Performance Analysis and Optimization - -## Overview - -This document analyzes the YAZE Renderer class and documents the performance optimizations implemented to improve ROM loading speed, particularly for overworld graphics initialization. - -## Original Performance Issues - -### 1. Blocking Texture Creation -The original `CreateAndRenderBitmap` method was creating GPU textures synchronously on the main thread during ROM loading: -- **Problem**: Each overworld map (160 maps ร— 512ร—512 pixels) required immediate GPU texture creation -- **Impact**: Main thread blocked for several seconds during ROM loading -- **Root Cause**: SDL texture creation is a GPU operation that blocks the rendering thread - -### 2. Inefficient Loading Pattern -```cpp -// Original blocking approach -for (int i = 0; i < kNumOverworldMaps; ++i) { - Renderer::Get().CreateAndRenderBitmap(...); // Blocks for each map -} -``` - -## Optimizations Implemented - -### 1. Deferred Texture Creation - -**New Method**: `CreateBitmapWithoutTexture` -- Creates bitmap data and SDL surface without GPU texture -- Allows bulk data processing without blocking -- Textures created on-demand when needed for rendering - -**Implementation**: -```cpp -void CreateBitmapWithoutTexture(int width, int height, int depth, - const std::vector &data, - gfx::Bitmap &bitmap, gfx::SnesPalette &palette) { - bitmap.Create(width, height, depth, data); - bitmap.SetPalette(palette); - // Note: No RenderBitmap call - texture creation is deferred -} -``` - -### 2. Lazy Loading System - -**Components**: -- `deferred_map_textures_`: Vector storing bitmaps waiting for texture creation -- `ProcessDeferredTextures()`: Processes 2 textures per frame to avoid blocking -- `EnsureMapTexture()`: Creates texture immediately when map becomes visible - -**Benefits**: -- Only visible maps get textures created initially -- Remaining textures created progressively without blocking UI -- Smooth user experience during loading - -### 3. Performance Monitoring - -**New Class**: `PerformanceMonitor` -- Tracks timing for all loading operations -- Provides detailed breakdown of where time is spent -- Helps identify future optimization opportunities - -**Usage**: -```cpp -{ - core::ScopedTimer timer("LoadGraphics"); - // ... loading operations ... -} // Automatically records duration -``` - -## Thread Safety Considerations - -### Main Thread Requirement -The Renderer class **MUST** be used only on the main thread because: -1. SDL_Renderer operations are not thread-safe -2. OpenGL/DirectX contexts are bound to the creating thread -3. Texture creation and rendering must happen on the main UI thread - -### Safe Optimization Approach -- Background processing: Bitmap data preparation (CPU-bound) -- Main thread: Texture creation and rendering (GPU-bound) -- Deferred execution: Spread texture creation across multiple frames - -## Performance Improvements - -### Loading Time Reduction -- **Before**: All 160 overworld maps created textures synchronously (~3-5 seconds blocking) -- **After**: Only 4 initial maps create textures, rest deferred (~200-500ms initial load) -- **User Experience**: Immediate responsiveness with progressive loading - -### Memory Efficiency -- Bitmap data created once, textures created on-demand -- No duplicate data structures -- Efficient memory usage with Arena texture management - -## Implementation Details - -### Modified Files -1. **`src/app/core/window.h`**: Added deferred texture methods and documentation -2. **`src/app/editor/overworld/overworld_editor.h`**: Added deferred texture tracking -3. **`src/app/editor/overworld/overworld_editor.cc`**: Implemented optimized loading -4. **`src/app/core/performance_monitor.h/.cc`**: Added performance tracking - -### Key Methods Added -- `CreateBitmapWithoutTexture()`: Non-blocking bitmap creation -- `BatchCreateTextures()`: Efficient batch texture creation -- `ProcessDeferredTextures()`: Progressive texture creation -- `EnsureMapTexture()`: On-demand texture creation - -## Usage Guidelines - -### For Developers -1. Use `CreateBitmapWithoutTexture()` for bulk operations during loading -2. Use `EnsureMapTexture()` when a bitmap needs to be rendered -3. Call `ProcessDeferredTextures()` in the main update loop -4. Always use `ScopedTimer` for performance-critical operations - -### For ROM Loading -1. Phase 1: Load all bitmap data without textures -2. Phase 2: Create textures only for visible/needed maps -3. Phase 3: Process remaining textures progressively - -## Future Optimization Opportunities - -### 1. Background Threading (Pending) -- Move bitmap data processing to background threads -- Keep only texture creation on main thread -- Requires careful synchronization - -### 2. Arena Management Optimization (Pending) -- Implement texture pooling for common sizes -- Add texture compression for large maps -- Optimize memory allocation patterns - -### 3. Advanced Lazy Loading (Pending) -- Implement viewport-based loading -- Add texture streaming for very large maps -- Cache frequently used textures - -## Conclusion - -The implemented optimizations provide significant performance improvements for ROM loading while maintaining thread safety and code clarity. The deferred texture creation system allows for smooth, responsive loading without blocking the main thread, dramatically improving the user experience when opening ROMs in YAZE. - -The performance monitoring system provides visibility into loading times and will help identify future optimization opportunities as the codebase evolves.