Benchmarking FAQ¶

Performance Questions¶

Is mrrc faster than pymarc?¶

Yes — substantially. On a realistic corpus the Python wrapper reads much faster than pymarc per record, and faster still through the parallel batch path; the native Rust crate is faster again. The speedup is structural: each record is parsed by compiled Rust instead of interpreted Python. See Results for the measured three-way figures and how to reproduce them on your own workload.

Does mrrc automatically use multiple cores?¶

No. By default, mrrc runs on one thread (like pymarc), but each record parses faster because parsing happens in Rust.

To use multiple cores:

Use ProducerConsumerPipeline for single-file processing
Use ThreadPoolExecutor to process multiple files in parallel

Do I need to change my code to get the single-threaded speedup?¶

No. If upgrading from pymarc, install mrrc and existing code benefits automatically.

Do I need to change my code for multi-threading?¶

Yes. For single-file multi-threaded processing, use ProducerConsumerPipeline. For multi-file processing, use ThreadPoolExecutor. See the threading guide for examples.

Is GIL release automatic?¶

Yes. Every call to read_record() automatically releases the GIL during parsing. No special handling required.

Understanding the Speedups¶

Single-Threaded¶

from mrrc import MARCReader

# Parsing runs in Rust; no per-record Python interpretation.
# Pass a path (not a file object) so I/O also runs in Rust, GIL-free.
reader = MARCReader("records.mrc")
while record := reader.read_record():
    title = record.title
    # ... process record ...

The Python wrapper adds overhead relative to pure Rust (each record crosses the Rust/Python boundary), but record parsing itself runs at Rust speed. File objects also work, but the GIL is held for their .read() calls — prefer paths.

Multi-Threaded¶

from concurrent.futures import ThreadPoolExecutor
from mrrc import MARCReader

def process_file(path):
    # Each thread gets its own reader; path input keeps I/O in Rust,
    # so threads never contend on the GIL for reads.
    reader = MARCReader(path)
    while record := reader.read_record():
        ...  # process record

# Process 4 files in parallel
with ThreadPoolExecutor(max_workers=4) as executor:
    executor.map(process_file, ["file1.mrc", "file2.mrc", "file3.mrc", "file4.mrc"])

Because the GIL is released during parsing, threads parse concurrently and speedup scales with core count. Efficiency is sub-linear: thread management, memory bandwidth, and the Python code between records all take their share. Measure with your own files to find the practical ceiling on your hardware.

Use multi-threading when processing multiple independent files. For single files, use ProducerConsumerPipeline.

Upgrade Considerations¶

When mrrc provides clear benefit:¶

Processing large files (100k+ records)
MARC processing is a bottleneck in your workflow
You need multi-threading for parallel file processing

When mrrc may not be necessary:¶

Pure Python environment required (no C extensions)
Deeply integrated custom code with pymarc internals
Processing very small files (< 1k records)

Benchmark Selection¶

Which benchmark matches my use case?¶

Raw reading (test_benchmark_reading.py): best case for mrrc
Reading + field extraction (test_benchmark_reading.py): typical use
Round-trip read + write (test_benchmark_writing.py): format conversion
Parallel processing (test_benchmark_parallel.py, test_benchmark_pipeline_parallel.py): processing many files or one large file with threads (local runs only — see Results)

What does "rec/s" mean?¶

Records per second. Higher is better. Throughput depends on hardware, record size, and what your code does with each record — measure on your own data.

GIL and Threading¶

What is the GIL?¶

Global Interpreter Lock. Python's mechanism to make the interpreter thread-safe. Only one thread can execute Python bytecode at a time. When one thread releases the GIL, other threads can run.

Does mrrc release the GIL?¶

Yes. During record parsing, the GIL is released via py.detach(). This happens automatically with every read_record() call.

Do I need to think about the GIL?¶

No. In single-threaded mode, it's automatic and transparent. In multi-threaded mode, the GIL release enables concurrent parsing.

Undefined behavior. The reader is not thread-safe. Create a new reader per thread.

# Wrong - do not share readers across threads
reader = MARCReader("file.mrc")
ThreadPoolExecutor().map(lambda _: reader.read_record(), range(4))

# Correct - each thread gets its own reader
def read_all(path):
    reader = MARCReader(path)
    while record := reader.read_record():
        # ... process ...

ThreadPoolExecutor().map(read_all, ["file1.mrc", "file2.mrc", "file3.mrc", "file4.mrc"])

Performance Tuning¶

How can I make mrrc faster?¶

File path input: Pass file paths directly instead of file objects
Multi-threading: Use ThreadPoolExecutor for multiple files, ProducerConsumerPipeline for single files
Batch processing: Read multiple records at once
Use Rust: Rewrite critical path in Rust if possible

Does record size affect performance?¶

Minimally. Throughput stays consistent regardless of record size.

Does Python version matter?¶

Only for availability. mrrc supports Python 3.10+. Performance is similar across versions.

Does OS matter?¶

Only for binary compatibility. mrrc provides wheels for Linux, macOS, and Windows. Performance is similar across platforms.