Files
yaze/docs/analysis/overworld_load_optimization_analysis.md
scawful 91a6a49d1a Add comprehensive analysis of ZScream vs YAZE overworld implementations
- Introduced a detailed comparison document highlighting the functional equivalence between ZScream (C#) and YAZE (C++) overworld loading logic.
- Verified key areas such as tile loading, expansion detection, map decompression, and coordinate calculations, confirming consistent behavior across both implementations.
- Documented differences and improvements in YAZE, including enhanced error handling and memory management.
- Provided validation results from integration tests ensuring data integrity and compatibility with existing ROMs.
2025-09-28 22:49:29 -04:00

7.7 KiB

Overworld::Load Performance Analysis and Optimization Plan

Current Performance Profile

Based on the performance report, Overworld::Load takes 2887.91ms (2.9 seconds), making it the primary bottleneck in ROM loading.

Detailed Analysis of Overworld::Load

Current Implementation Breakdown

absl::Status Overworld::Load(Rom* rom) {
  // 1. Tile Assembly (CPU-bound)
  RETURN_IF_ERROR(AssembleMap32Tiles());     // ~200-400ms
  RETURN_IF_ERROR(AssembleMap16Tiles());     // ~100-200ms
  
  // 2. Decompression (CPU-bound, memory-intensive)
  DecompressAllMapTiles();                   // ~1500-2000ms (MAJOR BOTTLENECK)
  
  // 3. Map Object Creation (fast)
  for (int map_index = 0; map_index < kNumOverworldMaps; ++map_index)
    overworld_maps_.emplace_back(map_index, rom_);
  
  // 4. Map Parent Assignment (fast)
  for (int map_index = 0; map_index < kNumOverworldMaps; ++map_index) {
    map_parent_[map_index] = overworld_maps_[map_index].parent();
  }
  
  // 5. Map Size Assignment (fast)
  if (asm_version >= 3) {
    AssignMapSizes(overworld_maps_);
  } else {
    FetchLargeMaps();
  }
  
  // 6. Data Loading (moderate)
  LoadTileTypes();                           // ~50-100ms
  RETURN_IF_ERROR(LoadEntrances());          // ~100-200ms
  RETURN_IF_ERROR(LoadHoles());              // ~50ms
  RETURN_IF_ERROR(LoadExits());              // ~100-200ms
  RETURN_IF_ERROR(LoadItems());              // ~100-200ms
  RETURN_IF_ERROR(LoadOverworldMaps());      // ~200-500ms (already parallelized)
  RETURN_IF_ERROR(LoadSprites());            // ~200-400ms
}

Major Bottlenecks Identified

1. DecompressAllMapTiles() - PRIMARY BOTTLENECK (~1.5-2.0 seconds)

Current Implementation Issues:

  • Sequential processing of 160 overworld maps
  • Each map calls HyruleMagicDecompress() twice (high/low pointers)
  • 320 decompression operations total
  • Each decompression involves complex algorithm with nested loops

Performance Impact:

for (int i = 0; i < kNumOverworldMaps; i++) {  // 160 iterations
  // Two expensive decompression calls per map
  auto bytes = gfx::HyruleMagicDecompress(rom()->data() + p2, &size1, 1);   // ~5-10ms each
  auto bytes2 = gfx::HyruleMagicDecompress(rom()->data() + p1, &size2, 1);  // ~5-10ms each
  OrganizeMapTiles(bytes, bytes2, i, sx, sy, ttpos);  // ~2-5ms each
}

2. AssembleMap32Tiles() - SECONDARY BOTTLENECK (~200-400ms)

Current Implementation Issues:

  • Sequential processing of tile32 data
  • Multiple ROM reads per tile
  • Complex tile assembly logic

3. AssembleMap16Tiles() - MODERATE BOTTLENECK (~100-200ms)

Current Implementation Issues:

  • Sequential processing of tile16 data
  • Multiple ROM reads per tile
  • Tile info processing

Optimization Strategies

1. Parallelize Decompression Operations

Strategy: Process multiple maps concurrently during decompression

absl::Status DecompressAllMapTilesParallel() {
  constexpr int kMaxConcurrency = std::thread::hardware_concurrency();
  constexpr int kMapsPerBatch = kNumOverworldMaps / kMaxConcurrency;
  
  std::vector<std::future<void>> futures;
  
  for (int batch = 0; batch < kMaxConcurrency; ++batch) {
    auto task = [this, batch, kMapsPerBatch]() {
      int start = batch * kMapsPerBatch;
      int end = std::min(start + kMapsPerBatch, kNumOverworldMaps);
      
      for (int i = start; i < end; ++i) {
        // Process map i decompression
        ProcessMapDecompression(i);
      }
    };
    futures.emplace_back(std::async(std::launch::async, task));
  }
  
  // Wait for all batches to complete
  for (auto& future : futures) {
    future.wait();
  }
  
  return absl::OkStatus();
}

Expected Improvement: 60-80% reduction in decompression time (2.0s → 0.4-0.8s)

2. Optimize ROM Access Patterns

Strategy: Batch ROM reads and cache frequently accessed data

// Cache ROM data in memory to reduce I/O overhead
class RomDataCache {
 private:
  std::unordered_map<uint32_t, std::vector<uint8_t>> cache_;
  const Rom* rom_;
  
 public:
  const std::vector<uint8_t>& GetData(uint32_t offset, size_t size) {
    auto it = cache_.find(offset);
    if (it == cache_.end()) {
      auto data = rom_->ReadBytes(offset, size);
      cache_[offset] = std::move(data);
      return cache_[offset];
    }
    return it->second;
  }
};

Expected Improvement: 10-20% reduction in ROM access time

3. Implement Lazy Map Loading

Strategy: Only load maps that are immediately needed

absl::Status Overworld::LoadEssentialMaps() {
  // Only load first few maps initially
  constexpr int kInitialMapCount = 8;
  
  RETURN_IF_ERROR(AssembleMap32Tiles());
  RETURN_IF_ERROR(AssembleMap16Tiles());
  
  // Load only essential maps
  DecompressEssentialMaps(kInitialMapCount);
  
  // Load remaining maps in background
  StartBackgroundMapLoading();
  
  return absl::OkStatus();
}

Expected Improvement: 70-80% reduction in initial loading time (2.9s → 0.6-0.9s)

4. Optimize HyruleMagicDecompress

Strategy: Profile and optimize the decompression algorithm

Current Algorithm Complexity:

  • Nested loops with O(n²) complexity in worst case
  • Multiple memory allocations and reallocations
  • String matching operations

Potential Optimizations:

  • Pre-allocate buffers to avoid reallocations
  • Optimize string matching with better algorithms
  • Use SIMD instructions for bulk operations
  • Cache decompression results for identical data

Expected Improvement: 20-40% reduction in decompression time

5. Memory Pool Optimization

Strategy: Use memory pools for frequent allocations

class DecompressionMemoryPool {
 private:
  std::vector<std::unique_ptr<uint8_t[]>> buffers_;
  size_t buffer_size_;
  
 public:
  uint8_t* AllocateBuffer(size_t size) {
    // Reuse existing buffers or allocate new ones
    if (size <= buffer_size_) {
      // Return existing buffer
    } else {
      // Allocate new buffer
    }
  }
  
  void ReleaseBuffer(uint8_t* buffer) {
    // Return buffer to pool
  }
};

Implementation Priority

Phase 1: High Impact, Low Risk (Immediate)

  1. Parallelize DecompressAllMapTiles - Biggest performance gain
  2. Implement lazy loading for non-essential maps
  3. Add performance monitoring to identify remaining bottlenecks

Phase 2: Medium Impact, Medium Risk (Next)

  1. Optimize ROM access patterns
  2. Implement memory pooling for decompression
  3. Profile and optimize HyruleMagicDecompress

Phase 3: Lower Impact, Higher Risk (Future)

  1. Rewrite decompression algorithm with SIMD
  2. Implement advanced caching strategies
  3. Consider alternative data formats for faster loading

Expected Performance Improvements

Conservative Estimates

  • Current: 2887ms total loading time
  • After Phase 1: 800-1200ms (60-70% improvement)
  • After Phase 2: 500-800ms (70-80% improvement)
  • After Phase 3: 300-500ms (80-85% improvement)

Aggressive Estimates

  • Current: 2887ms total loading time
  • After Phase 1: 600-900ms (70-80% improvement)
  • After Phase 2: 300-500ms (80-85% improvement)
  • After Phase 3: 200-400ms (85-90% improvement)

Conclusion

The primary optimization opportunity is in DecompressAllMapTiles(), which represents the majority of the loading time. By implementing parallel processing and lazy loading, we can achieve significant performance improvements while maintaining code reliability.

The optimizations should focus on:

  1. Parallelization of CPU-bound operations
  2. Lazy loading of non-essential data
  3. Memory optimization to reduce allocation overhead
  4. ROM access optimization to reduce I/O bottlenecks

These changes will dramatically improve the user experience during ROM loading while maintaining the same functionality and data integrity.