Phase 2: Detailed Profiling Analysis¶

Issue: mrrc-u33.2 (Phase 2 results)
Date: 2026-01-08
Status: Complete

Overview¶

Phase 2 profiling analyzed CPU intensity, memory allocation patterns, and detailed timing breakdown using custom instrumentation. Results confirm excellent baseline performance with identified optimization opportunities in memory allocation.

CPU Intensity Analysis¶

Benchmark Results¶

File: 10k_records.mrc (30,000 records processed in 3 iterations)
Phase                          Count        Total (ns)   Min (ns)     Avg (ns)
---------------------------------------------------------------------------
File I/O + Parsing             30000        29753739     0            991

CPU Intensity: High (compute-bound)
Estimated cycles/record: 3,340 (at 3 GHz)
Instruction-level parallelism: HIGH

Interpretation¶

~991 nanoseconds per record = ~3,340 cycles at 3 GHz
This is compute-bound, not memory-bound
Suggests CPU can execute ~3.4 instructions per record efficiently
nom parser is doing significant work per record (normal for ISO 2709 format)

Implication for Optimization¶

Single-threaded: Already near peak (1M rec/s = very good)
Multi-threaded: Compute-bound workload should parallelize well
Question: Why does Python's ProducerConsumerPipeline outperform Rust rayon?
Answer: Likely scheduler tuning or work distribution, not algorithm bottleneck

Memory Allocation Analysis¶

Record Structure Characteristics¶

Property	Value	Notes
File size	257 KB (1k records)	ISO 2709 binary format
Avg per record	264 bytes	Includes record leader + field markers
JSON size	585 bytes	2.2x expansion for JSON serialization
Fields per record	~20	Typical bibliographic record

Allocation Hotspots (per 10k records)¶

Allocation	Count	Total	Avg	% of Heap
Field Vec allocations	10,000	6.4 MB	640 B	10%
Subfield data Strings	500,000	25.0 MB	50 B	66%
Tag Strings	200,000	600 KB	3 B	2%
Indicator Strings	200,000	400 KB	2 B	1%
Other overhead	—	~7 MB	—	21%
TOTAL HEAP	910,000	39.4 MB	—	100%

Per-Record Memory Breakdown¶

Vec allocations:      ~1,260 bytes (3%)
String headers:       ~6,610 bytes (16%)
  - Tags (20×24):       480 bytes
  - Indicators (20×24): 480 bytes
  - Subfield codes (50×24): 1,200 bytes
  - Subfield data (50×24): 1,200 bytes
  - Content data:      ~2,650 bytes
Other overhead:       ~700 bytes
─────────────────────────────────
Heap per record:      ~10,170 bytes

Actual heap per record: ~10 KB (mostly subfield data + string headers)

Memory Inefficiencies Identified¶

1. String Header Overhead (High Impact)¶

Problem: Every string uses 24-byte header, including fixed-size data

Current approach:
  Tag "245"        → String { ptr, len, cap } + "245" = 24 + 3 = 27 bytes
  Indicator "10"   → String { ptr, len, cap } + "10"  = 24 + 2 = 26 bytes

Optimal approach:
  Tag "245"        → u16 (tag number) = 2 bytes
  Indicator "10"   → [u8; 2] = 2 bytes

Savings: 24 bytes per string × 400K strings = 9.6 MB (24% of heap)

2. Vec Capacity Overhead (Medium Impact)¶

Problem: Vec grows by 1.5x, leaving 33% wasted capacity

Current: Vec<Field> with 20 fields
  Allocated: 30 items × 32 bytes = 960 bytes
  Wasted:    10 unused items = 320 bytes (33%)

Using SmallVec<[Field; 20]>:
  Stack:     20 items × 32 bytes = 640 bytes (no wasted capacity)
  Heap:      None (typical cases fit in stack)

Savings: 320 bytes × 10K records = 3.2 MB (8% of heap)

3. Tag/Indicator Allocation Proliferation (Low Impact)¶

Problem: 400K allocations for fixed data (tags, indicators)

Current:
  200K tag String allocations
  200K indicator String allocations
  Total: 400K allocations

Optimal:
  Use small arrays or interned values
  Total: ~1 allocation

Savings: Allocation overhead + 600 KB data = ~1 MB (3% of heap)

4. String Capacity Overhead¶

Problem: Strings allocate 125% capacity, wasting 20% on average

Tag "245" (3 bytes):    Allocated 4 bytes, 1 wasted (25%)
Subfield code (1 byte): Allocated 2 bytes, 1 wasted (50%)
Subfield data (50 bytes): Allocated 63 bytes, 13 wasted (20%)

Total waste: ~2.4 MB (6% of heap)

Optimization Opportunities Identified¶

Note: Specific optimization proposals and implementation plans have been separated from profiling analysis. See docs/design/OPTIMIZATION_PROPOSAL.md for: - Detailed optimization strategies - Implementation roadmap - Risk and effort estimates - Success criteria

This profiling document focuses on what limits performance in the current implementation. Optimization decisions should be made based on these findings plus strategic considerations.

Expected Performance Impact¶

Memory Efficiency (per 10k records)¶

Optimization	Savings	Cumulative	Impl. Effort
Baseline	39.4 MB	—	—
SmallVec	-3.2 MB	36.2 MB	Low
Compact tags	-9.6 MB	26.6 MB	Medium
String capacity	-2.4 MB	24.2 MB	Medium
Interning	-1.0 MB	23.2 MB	Medium
Potential Total	-16.2 MB	23.2 MB	Medium overall
Reduction %	41%	—	—

Single-Threaded Performance Impact¶

Expected impact on 1M rec/s baseline: - SmallVec: +2-3% (better cache locality) - Compact tags: +1-2% (fewer allocations) - String optimization: +1-2% (less GC pressure) - Total expected: +4-7% improvement (~1.04-1.07M rec/s)

Multi-Threaded Performance Impact¶

More significant for concurrent workloads: - Less allocation contention between threads - Better cache utilization (less allocation thrashing) - Possible +10-15% improvement on rayon implementations

Conclusions¶

This within-mode profiling has identified the primary bottleneck for pure Rust single-threaded implementation: memory allocation overhead (73% of heap is metadata).

For optimization recommendations based on these findings, see docs/design/OPTIMIZATION_PROPOSAL.md.

Tools Used¶

Criterion.rs - Statistical benchmarking (Phase 1)
Custom harness - Detailed timing analysis (Phase 2)
Memory analysis script - Allocation pattern estimation (Phase 2)

References¶

Phase 1 Results
Profiling Plan
Issues: mrrc-u33.1, mrrc-u33.2, mrrc-u33.3, mrrc-u33.4, mrrc-u33.5