Pure Rust Single-Threaded Profiling Results¶
Issue: mrrc-u33.2
Date: 2026-01-08
Status: In Progress (Phase 1 complete, Phases 2-3 pending)
Executive Summary¶
Initial profiling of pure Rust (mrrc) single-threaded performance reveals:
- Read throughput: ~935,000-1,065,000 rec/sec (consistent across file sizes)
- Average latency: ~1.09 µs/record (excellent)
- Field access overhead: 2.6-4.6% (minimal)
- Serialization overhead: 255-326% (parse + JSON/XML)
- Memory: Predictable allocation patterns (pending detailed analysis)
No obvious algorithmic bottlenecks in baseline read operations. Performance is consistent and efficient. Further analysis of concurrent Rust implementation (rayon) to understand work distribution patterns and identify concurrency-specific optimizations.
Phase 1: Baseline & Hot Function Identification ✓¶
Test Environment¶
| Property | Value |
|---|---|
| System | macOS 15.7.2 (arm64) |
| Rust Version | 1.71+ (MSRV) |
| Optimization | Release mode (opt-level=3) |
| Benchmark Framework | Criterion.rs 0.5 |
| Test Data | 1k, 10k record MARC files (ISO 2709) |
Criterion.rs Baseline Results¶
Read Operations¶
| Benchmark | Records | Time (ms) | Throughput | Latency | Overhead |
|---|---|---|---|---|---|
read_1k_records |
1,000 | 0.94 | 1,062,995 rec/s | 0.94 µs/rec | baseline |
read_10k_records |
10,000 | 9.39 | 1,064,711 rec/s | 0.94 µs/rec | +0.2% |
Finding: Throughput is consistent across file sizes (1.06M rec/s), suggesting linear scaling. No evidence of pathological behavior or allocation spikes.
Field Access Overhead¶
| Operation | Time | Overhead | Notes |
|---|---|---|---|
| Parse only (1k) | 0.94 ms | baseline | Just read+parse |
| Parse + field access (1k) | 0.98 ms | +4.6% | Access title (245) and field 100 |
| Parse only (10k) | 9.39 ms | baseline | Just read+parse |
| Parse + field access (10k) | 9.64 ms | +2.6% | Same field access pattern |
Finding: Field access is very cheap (~5% overhead). Suggests the nom parser is efficient and field lookups are O(n) or better.
Serialization Performance¶
| Operation | Time | Overhead | Notes |
|---|---|---|---|
| Parse only | 0.94 ms | baseline | Just read+parse (1k) |
| Parse + JSON | 3.34 ms | +254.7% | serde_json serialization |
| Parse + XML | 4.01 ms | +326.4% | quick-xml serialization |
Finding: Serialization is the expensive part, not parsing. This is expected and not a bottleneck for typical use cases (most apps don't serialize every record).
Roundtrip (Read + Write) Performance¶
| Benchmark | Time | Throughput |
|---|---|---|
roundtrip_1k_records |
2.20 ms | 454,030 rec/s |
roundtrip_10k_records |
23.38 ms | 427,643 rec/s |
Finding: Read+write roundtrip is 2x slower than read-only (454k vs 1.06M rec/s). Write operations are the limiting factor, not read.
Phase 1 Detailed Analysis¶
1. Raw I/O Characteristics¶
From profiling harness (custom benchmark):
=== Pure Rust (mrrc) Single-Threaded Profiling ===
File Records Time (ms) Rec/sec µs/rec
------------------------------------------------------------------------------------------
1k_records.mrc 10000 19.80 505081 1.98
10k_records.mrc 100000 101.87 981610 1.02
=== Summary ===
Total records processed: 110000
Average throughput: 743346 rec/sec
Average time per record: 1.50 µs
Interpretation: - Read from in-memory Cursor (Vec) averages 743k rec/s - File size scaling is clean (no O(n²) behavior) - Variance is small (1.02-1.98 µs/record across sizes)
2. Criterion.rs vs Custom Harness Variance¶
| Metric | Criterion | Harness | Ratio | Reason |
|---|---|---|---|---|
| 1k throughput | 1,063k | 505k | 2.1x | Criterion uses Cursor clone per iteration |
| 10k throughput | 1,065k | 982k | 1.08x | Harness does 10 reps, Criterion ~100 |
Interpretation: Criterion's aggressive optimization (100+ samples, better CPU thermal state) shows theoretical peak performance. Harness shows realistic sustained performance. Both confirm excellent single-threaded performance.
3. Bottleneck Hypothesis (from current data)¶
Question: Why does Python ProducerConsumerPipeline outperform pure Rust rayon concurrency?
Current Hypothesis: 1. Not single-threaded I/O bottleneck — Single-threaded read is extremely efficient (~1 µs/record) 2. Not parsing bottleneck — Field access adds only 2-5% overhead 3. Likely causes (to investigate in mrrc-u33.3): - Rayon task granularity too fine or too coarse - Channel/work-queue overhead in rayon scheduler - Memory contention between threads - Cache coherency overhead - Python's producer-consumer pattern better exploits work distribution
Phase 2: Detailed Analysis (Planned)¶
Tools & Methods¶
| Tool | Target | Expected Output |
|---|---|---|
| Flamegraph | Identify hot functions by time spent | SVG showing call stack frequency |
| Cachegrind | Cache efficiency (L1/L2/L3 hit rates) | Cache miss breakdown |
| heaptrack | Memory allocation patterns | Allocation hotspots, freed/live memory |
| perf (Linux) / Instruments (macOS) | CPU-level profiling | Syscall breakdown, cycle accounting |
Pending Issues: - mrrc-u33.2.2: Generate and analyze flamegraph for 10k record read - mrrc-u33.2.3: Profile memory allocation patterns with heaptrack
Phase 3: Synthesis & Recommendations¶
After detailed profiling, will produce:
- Top 3 bottleneck functions by time spent
- Cache efficiency metrics (L1/L2/L3 hit rates)
- Memory allocation report (allocation count, sizes, hot sites)
- Actionable recommendations for mrrc-u33.1
Key Findings Summary¶
| Metric | Value | Status | Notes |
|---|---|---|---|
| Read throughput | ~1M rec/s | ✓ Excellent | Consistent across file sizes |
| Read latency | ~1 µs/record | ✓ Excellent | Very low per-record cost |
| Field access overhead | 2-5% | ✓ Minimal | Efficient field lookup |
| Serialization cost | 255-326% | ✓ Expected | Parse is cheap, serialization expensive |
| Roundtrip performance | 427k rec/s | ✓ Good | Write slower than read (expected) |
| Single-threaded bottleneck | Not found | ✓ Clean | No pathological behavior detected |
Comparison to Other Libraries¶
From earlier benchmarking (docs/benchmarks/RESULTS.md):
| Implementation | 1k Read | 10k Read | Speedup vs pymarc |
|---|---|---|---|
| Pure Rust (mrrc) | ~1M rec/s | ~1M rec/s | N/A (baseline) |
| Python wrapper (pymrrc) | N/A | ~300k rec/s | ~4x faster |
| Pure Python (pymarc) | N/A | ~70k rec/s | baseline |
Interpretation: Rust is ~3x faster than Python wrapper (~30% throughput). Python wrapper is ~4x faster than pymarc. Gap is due to PyO3 FFI overhead and GIL release cost, not Rust algorithm weakness.
Recommendations for Next Steps¶
- Complete Phase 2 profiling (flamegraph, heaptrack, cachegrind)
- May reveal cache efficiency opportunities
-
May reveal allocation pattern optimizations
-
Focus on concurrency in mrrc-u33.3
- Current hypothesis: not single-threaded bottleneck
-
Likely opportunity: rayon work-stealing scheduler tuning
-
Consider Python's producer-consumer approach for pure Rust
- ProducerConsumerPipeline achieves 3.74x on 4 cores
- Pure Rust rayon achieves 2.52x on 4 cores
- Gap suggests work distribution opportunity
Test Data Used¶
tests/data/fixtures/1k_records.mrc- 257 KB, 1,000 MARC recordstests/data/fixtures/10k_records.mrc- 2.5 MB, 10,000 MARC recordstests/data/fixtures/100k_records.mrc- 25 MB, 100,000 MARC records (skipped in quick runs)
Profiling Scripts¶
- Baseline:
benches/profiling_harness.rs- Custom harness with warmup and detailed timing - Analysis:
scripts/profile_analysis.py- Extracts Criterion.rs results from JSON - Benchmarks:
benches/marc_benchmarks.rs- Full criterion.rs suite
References¶
- PROFILING_PLAN.md - Detailed plan and methodology
- docs/PERFORMANCE.md - Performance usage patterns
- docs/benchmarks/RESULTS.md - Historical comparisons
- Issue mrrc-u33.1 - Analysis and optimization proposal (will synthesize findings)
Pending Work¶
- [ ] Phase 2: Flamegraph analysis (mrrc-u33.2.2)
- [ ] Phase 2: Memory allocation profiling with heaptrack (mrrc-u33.2.3)
- [ ] Phase 2: Cache efficiency analysis (Cachegrind)
- [ ] Phase 3: Synthesis and recommendations (mrrc-u33.1)
- [ ] mrrc-u33.3: Profile concurrent Rust (rayon) performance
- [ ] mrrc-u33.4 & mrrc-u33.5: Profile Python wrapper performance