Pure Rust Single-Threaded Profiling Plan¶
Issue: mrrc-u33.2
Objective: Comprehensive profiling of pure Rust (mrrc) single-threaded file reading performance to identify bottlenecks for optimization.
Background¶
Current baseline performance (from criterion.rs): - Read 1k records: ~0.94ms (1,062,995 rec/s) - Read 10k records: ~9.39ms (1,064,711 rec/s)
This profiling aims to identify bottlenecks within the pure Rust single-threaded implementation to understand where performance is limited and what optimization opportunities exist in this mode.
Profiling Targets¶
1. Raw File I/O (syscall overhead)¶
- Baseline: Current buffered read strategy
- Question: How much time is spent in read operations vs. processing?
- Method: perf syscall tracing, strace
2. Record Boundary Detection (leader identification)¶
- Current: Scan for
0x1d(field terminator) to find record boundaries - Question: Is byte-scanning the bottleneck? Can vectorization help?
- Method: flamegraph to see time in parsing loop, cachegrind for branch prediction
3. MARC Record Parsing (field extraction)¶
- Current: nom parser for variable fields
- Question: Is nom overhead significant? Are there hot loops?
- Method: flamegraph, perf instruction-level profiling
4. Memory Allocation Patterns¶
- Current: Vec allocations for records and fields
- Question: Are we allocating too often? Are sizes predictable?
- Method: heaptrack, cachegrind
Tools and Methods¶
| Tool | Purpose | Command |
|---|---|---|
| Criterion.rs | Baseline measurements (already in use) | cargo bench --release |
| flamegraph | Wall-clock profiling, identify hot functions | cargo flamegraph --bench marc_benchmarks |
| perf | CPU profiling, cache behavior | perf record / perf report |
| cachegrind | Cache efficiency, memory patterns | valgrind --tool=cachegrind |
| heaptrack | Memory allocation hotspots | heaptrack_app |
Execution Plan¶
Phase 1: Baseline & Hot Function Identification (15 min)¶
- Run criterion benchmarks with default 10k test set
- Generate flamegraph for 10k record read
- Identify top 3 time-consuming functions
Phase 2: Detailed Analysis (30 min)¶
- For top function: Run cachegrind to understand cache behavior
- For syscalls: Run perf with syscall tracing
- For memory: Run heaptrack to find allocation patterns
Phase 3: Bottleneck Hypothesis (10 min)¶
- Synthesize findings
- Generate hypothesis about root cause(s)
- List potential optimization targets
Success Criteria¶
✓ Generate flamegraph showing function breakdown
✓ Identify top 3 bottleneck functions by time spent
✓ Quantify cache miss rate for hot functions
✓ Document allocation patterns (count, sizes, frequency)
✓ Produce written analysis with findings and hypotheses
✓ Create actionable recommendations for mrrc-u33.1
Deliverables¶
All outputs to be stored in docs/design/profiling/:
- RUST_SINGLE_THREADED_PROFILING_RESULTS.md
- Flamegraph analysis (images + interpretation)
- Cache statistics (L1, L2, L3 hit rates)
- Syscall breakdown
- Memory allocation report
-
Summary table of findings
-
Flamegraph images
read_10k_flamegraph.svg(full 10k record read)-
read_1k_flamegraph.svg(quick profile) -
Perf output (raw data)
perf_syscalls.txt-
perf_report.txt -
Heaptrack output (raw data)
-
heaptrack.dataor summary report -
Cachegrind output (raw data)
- Top functions by cache misses
Notes¶
- All benchmarks use
--releasemode (opt-level=3) - Test fixture:
10k_records.mrc(standard, ~2.5MB) - Flamegraph uses sampling at 99Hz frequency (default)
- Cachegrind simulates modern Intel CPU cache behavior
- Heaptrack captures every allocation (may slow execution)
Next Steps (After Profiling)¶
Results feed into bottleneck analysis and optimization proposals (see docs/design/OPTIMIZATION_PROPOSAL.md).
Key questions this profiling answers: 1. Is I/O the bottleneck or parsing? 2. Can we reduce allocations? 3. Are there cache-friendly optimizations? 4. What limits performance in this single-threaded mode?