Skip to content

Pure Rust Single-Threaded Profiling Plan

Issue: mrrc-u33.2
Objective: Comprehensive profiling of pure Rust (mrrc) single-threaded file reading performance to identify bottlenecks for optimization.

Background

Current baseline performance (from criterion.rs): - Read 1k records: ~0.94ms (1,062,995 rec/s) - Read 10k records: ~9.39ms (1,064,711 rec/s)

This profiling aims to identify bottlenecks within the pure Rust single-threaded implementation to understand where performance is limited and what optimization opportunities exist in this mode.

Profiling Targets

1. Raw File I/O (syscall overhead)

  • Baseline: Current buffered read strategy
  • Question: How much time is spent in read operations vs. processing?
  • Method: perf syscall tracing, strace

2. Record Boundary Detection (leader identification)

  • Current: Scan for 0x1d (field terminator) to find record boundaries
  • Question: Is byte-scanning the bottleneck? Can vectorization help?
  • Method: flamegraph to see time in parsing loop, cachegrind for branch prediction

3. MARC Record Parsing (field extraction)

  • Current: nom parser for variable fields
  • Question: Is nom overhead significant? Are there hot loops?
  • Method: flamegraph, perf instruction-level profiling

4. Memory Allocation Patterns

  • Current: Vec allocations for records and fields
  • Question: Are we allocating too often? Are sizes predictable?
  • Method: heaptrack, cachegrind

Tools and Methods

Tool Purpose Command
Criterion.rs Baseline measurements (already in use) cargo bench --release
flamegraph Wall-clock profiling, identify hot functions cargo flamegraph --bench marc_benchmarks
perf CPU profiling, cache behavior perf record / perf report
cachegrind Cache efficiency, memory patterns valgrind --tool=cachegrind
heaptrack Memory allocation hotspots heaptrack_app

Execution Plan

Phase 1: Baseline & Hot Function Identification (15 min)

  1. Run criterion benchmarks with default 10k test set
  2. Generate flamegraph for 10k record read
  3. Identify top 3 time-consuming functions

Phase 2: Detailed Analysis (30 min)

  1. For top function: Run cachegrind to understand cache behavior
  2. For syscalls: Run perf with syscall tracing
  3. For memory: Run heaptrack to find allocation patterns

Phase 3: Bottleneck Hypothesis (10 min)

  • Synthesize findings
  • Generate hypothesis about root cause(s)
  • List potential optimization targets

Success Criteria

✓ Generate flamegraph showing function breakdown
✓ Identify top 3 bottleneck functions by time spent
✓ Quantify cache miss rate for hot functions
✓ Document allocation patterns (count, sizes, frequency)
✓ Produce written analysis with findings and hypotheses
✓ Create actionable recommendations for mrrc-u33.1

Deliverables

All outputs to be stored in docs/design/profiling/:

  1. RUST_SINGLE_THREADED_PROFILING_RESULTS.md
  2. Flamegraph analysis (images + interpretation)
  3. Cache statistics (L1, L2, L3 hit rates)
  4. Syscall breakdown
  5. Memory allocation report
  6. Summary table of findings

  7. Flamegraph images

  8. read_10k_flamegraph.svg (full 10k record read)
  9. read_1k_flamegraph.svg (quick profile)

  10. Perf output (raw data)

  11. perf_syscalls.txt
  12. perf_report.txt

  13. Heaptrack output (raw data)

  14. heaptrack.data or summary report

  15. Cachegrind output (raw data)

  16. Top functions by cache misses

Notes

  • All benchmarks use --release mode (opt-level=3)
  • Test fixture: 10k_records.mrc (standard, ~2.5MB)
  • Flamegraph uses sampling at 99Hz frequency (default)
  • Cachegrind simulates modern Intel CPU cache behavior
  • Heaptrack captures every allocation (may slow execution)

Next Steps (After Profiling)

Results feed into bottleneck analysis and optimization proposals (see docs/design/OPTIMIZATION_PROPOSAL.md).

Key questions this profiling answers: 1. Is I/O the bottleneck or parsing? 2. Can we reduce allocations? 3. Are there cache-friendly optimizations? 4. What limits performance in this single-threaded mode?