- Introduced a detailed comparison document highlighting the functional equivalence between ZScream (C#) and YAZE (C++) overworld loading logic. - Verified key areas such as tile loading, expansion detection, map decompression, and coordinate calculations, confirming consistent behavior across both implementations. - Documented differences and improvements in YAZE, including enhanced error handling and memory management. - Provided validation results from integration tests ensuring data integrity and compatibility with existing ROMs.
7.7 KiB
Overworld::Load Performance Analysis and Optimization Plan
Current Performance Profile
Based on the performance report, Overworld::Load takes 2887.91ms (2.9 seconds), making it the primary bottleneck in ROM loading.
Detailed Analysis of Overworld::Load
Current Implementation Breakdown
absl::Status Overworld::Load(Rom* rom) {
// 1. Tile Assembly (CPU-bound)
RETURN_IF_ERROR(AssembleMap32Tiles()); // ~200-400ms
RETURN_IF_ERROR(AssembleMap16Tiles()); // ~100-200ms
// 2. Decompression (CPU-bound, memory-intensive)
DecompressAllMapTiles(); // ~1500-2000ms (MAJOR BOTTLENECK)
// 3. Map Object Creation (fast)
for (int map_index = 0; map_index < kNumOverworldMaps; ++map_index)
overworld_maps_.emplace_back(map_index, rom_);
// 4. Map Parent Assignment (fast)
for (int map_index = 0; map_index < kNumOverworldMaps; ++map_index) {
map_parent_[map_index] = overworld_maps_[map_index].parent();
}
// 5. Map Size Assignment (fast)
if (asm_version >= 3) {
AssignMapSizes(overworld_maps_);
} else {
FetchLargeMaps();
}
// 6. Data Loading (moderate)
LoadTileTypes(); // ~50-100ms
RETURN_IF_ERROR(LoadEntrances()); // ~100-200ms
RETURN_IF_ERROR(LoadHoles()); // ~50ms
RETURN_IF_ERROR(LoadExits()); // ~100-200ms
RETURN_IF_ERROR(LoadItems()); // ~100-200ms
RETURN_IF_ERROR(LoadOverworldMaps()); // ~200-500ms (already parallelized)
RETURN_IF_ERROR(LoadSprites()); // ~200-400ms
}
Major Bottlenecks Identified
1. DecompressAllMapTiles() - PRIMARY BOTTLENECK (~1.5-2.0 seconds)
Current Implementation Issues:
- Sequential processing of 160 overworld maps
- Each map calls
HyruleMagicDecompress()twice (high/low pointers) - 320 decompression operations total
- Each decompression involves complex algorithm with nested loops
Performance Impact:
for (int i = 0; i < kNumOverworldMaps; i++) { // 160 iterations
// Two expensive decompression calls per map
auto bytes = gfx::HyruleMagicDecompress(rom()->data() + p2, &size1, 1); // ~5-10ms each
auto bytes2 = gfx::HyruleMagicDecompress(rom()->data() + p1, &size2, 1); // ~5-10ms each
OrganizeMapTiles(bytes, bytes2, i, sx, sy, ttpos); // ~2-5ms each
}
2. AssembleMap32Tiles() - SECONDARY BOTTLENECK (~200-400ms)
Current Implementation Issues:
- Sequential processing of tile32 data
- Multiple ROM reads per tile
- Complex tile assembly logic
3. AssembleMap16Tiles() - MODERATE BOTTLENECK (~100-200ms)
Current Implementation Issues:
- Sequential processing of tile16 data
- Multiple ROM reads per tile
- Tile info processing
Optimization Strategies
1. Parallelize Decompression Operations
Strategy: Process multiple maps concurrently during decompression
absl::Status DecompressAllMapTilesParallel() {
constexpr int kMaxConcurrency = std::thread::hardware_concurrency();
constexpr int kMapsPerBatch = kNumOverworldMaps / kMaxConcurrency;
std::vector<std::future<void>> futures;
for (int batch = 0; batch < kMaxConcurrency; ++batch) {
auto task = [this, batch, kMapsPerBatch]() {
int start = batch * kMapsPerBatch;
int end = std::min(start + kMapsPerBatch, kNumOverworldMaps);
for (int i = start; i < end; ++i) {
// Process map i decompression
ProcessMapDecompression(i);
}
};
futures.emplace_back(std::async(std::launch::async, task));
}
// Wait for all batches to complete
for (auto& future : futures) {
future.wait();
}
return absl::OkStatus();
}
Expected Improvement: 60-80% reduction in decompression time (2.0s → 0.4-0.8s)
2. Optimize ROM Access Patterns
Strategy: Batch ROM reads and cache frequently accessed data
// Cache ROM data in memory to reduce I/O overhead
class RomDataCache {
private:
std::unordered_map<uint32_t, std::vector<uint8_t>> cache_;
const Rom* rom_;
public:
const std::vector<uint8_t>& GetData(uint32_t offset, size_t size) {
auto it = cache_.find(offset);
if (it == cache_.end()) {
auto data = rom_->ReadBytes(offset, size);
cache_[offset] = std::move(data);
return cache_[offset];
}
return it->second;
}
};
Expected Improvement: 10-20% reduction in ROM access time
3. Implement Lazy Map Loading
Strategy: Only load maps that are immediately needed
absl::Status Overworld::LoadEssentialMaps() {
// Only load first few maps initially
constexpr int kInitialMapCount = 8;
RETURN_IF_ERROR(AssembleMap32Tiles());
RETURN_IF_ERROR(AssembleMap16Tiles());
// Load only essential maps
DecompressEssentialMaps(kInitialMapCount);
// Load remaining maps in background
StartBackgroundMapLoading();
return absl::OkStatus();
}
Expected Improvement: 70-80% reduction in initial loading time (2.9s → 0.6-0.9s)
4. Optimize HyruleMagicDecompress
Strategy: Profile and optimize the decompression algorithm
Current Algorithm Complexity:
- Nested loops with O(n²) complexity in worst case
- Multiple memory allocations and reallocations
- String matching operations
Potential Optimizations:
- Pre-allocate buffers to avoid reallocations
- Optimize string matching with better algorithms
- Use SIMD instructions for bulk operations
- Cache decompression results for identical data
Expected Improvement: 20-40% reduction in decompression time
5. Memory Pool Optimization
Strategy: Use memory pools for frequent allocations
class DecompressionMemoryPool {
private:
std::vector<std::unique_ptr<uint8_t[]>> buffers_;
size_t buffer_size_;
public:
uint8_t* AllocateBuffer(size_t size) {
// Reuse existing buffers or allocate new ones
if (size <= buffer_size_) {
// Return existing buffer
} else {
// Allocate new buffer
}
}
void ReleaseBuffer(uint8_t* buffer) {
// Return buffer to pool
}
};
Implementation Priority
Phase 1: High Impact, Low Risk (Immediate)
- Parallelize DecompressAllMapTiles - Biggest performance gain
- Implement lazy loading for non-essential maps
- Add performance monitoring to identify remaining bottlenecks
Phase 2: Medium Impact, Medium Risk (Next)
- Optimize ROM access patterns
- Implement memory pooling for decompression
- Profile and optimize HyruleMagicDecompress
Phase 3: Lower Impact, Higher Risk (Future)
- Rewrite decompression algorithm with SIMD
- Implement advanced caching strategies
- Consider alternative data formats for faster loading
Expected Performance Improvements
Conservative Estimates
- Current: 2887ms total loading time
- After Phase 1: 800-1200ms (60-70% improvement)
- After Phase 2: 500-800ms (70-80% improvement)
- After Phase 3: 300-500ms (80-85% improvement)
Aggressive Estimates
- Current: 2887ms total loading time
- After Phase 1: 600-900ms (70-80% improvement)
- After Phase 2: 300-500ms (80-85% improvement)
- After Phase 3: 200-400ms (85-90% improvement)
Conclusion
The primary optimization opportunity is in DecompressAllMapTiles(), which represents the majority of the loading time. By implementing parallel processing and lazy loading, we can achieve significant performance improvements while maintaining code reliability.
The optimizations should focus on:
- Parallelization of CPU-bound operations
- Lazy loading of non-essential data
- Memory optimization to reduce allocation overhead
- ROM access optimization to reduce I/O bottlenecks
These changes will dramatically improve the user experience during ROM loading while maintaining the same functionality and data integrity.