Skip to content

Benchmarking FAQ

Performance Questions

Is pymrrc faster than pymarc?

Early benchmarking suggested at least a 4x single-threaded speedup over pymarc; these benchmarks need to be updated and reconsidered. The mechanism behind the speedup is structural: each record is parsed by compiled Rust code instead of interpreted Python. To measure the difference on your own workload, see Results.

Does pymrrc automatically use multiple cores?

No. By default, pymrrc runs on one thread (like pymarc), but each record parses faster because parsing happens in Rust.

To use multiple cores:

  • Use ProducerConsumerPipeline for single-file processing
  • Use ThreadPoolExecutor to process multiple files in parallel

Do I need to change my code to get the single-threaded speedup?

No. If upgrading from pymarc, install pymrrc and existing code benefits automatically.

Do I need to change my code for multi-threading?

Yes. For single-file multi-threaded processing, use ProducerConsumerPipeline. For multi-file processing, use ThreadPoolExecutor. See the threading guide for examples.

Is GIL release automatic?

Yes. Every call to read_record() automatically releases the GIL during parsing. No special handling required.


Understanding the Speedups

Single-Threaded

from mrrc import MARCReader

# Parsing runs in Rust; no per-record Python interpretation.
# Pass a path (not a file object) so I/O also runs in Rust, GIL-free.
reader = MARCReader("records.mrc")
while record := reader.read_record():
    title = record.title
    # ... process record ...

The Python wrapper adds overhead relative to pure Rust (each record crosses the Rust/Python boundary), but record parsing itself runs at Rust speed. File objects also work, but the GIL is held for their .read() calls — prefer paths.

Multi-Threaded

from concurrent.futures import ThreadPoolExecutor
from mrrc import MARCReader

def process_file(path):
    # Each thread gets its own reader; path input keeps I/O in Rust,
    # so threads never contend on the GIL for reads.
    reader = MARCReader(path)
    while record := reader.read_record():
        ...  # process record

# Process 4 files in parallel
with ThreadPoolExecutor(max_workers=4) as executor:
    executor.map(process_file, ["file1.mrc", "file2.mrc", "file3.mrc", "file4.mrc"])

Because the GIL is released during parsing, threads parse concurrently and speedup scales with core count. Efficiency is sub-linear: thread management, memory bandwidth, and the Python code between records all take their share. Measure with your own files to find the practical ceiling on your hardware.

Use multi-threading when processing multiple independent files. For single files, use ProducerConsumerPipeline.


Upgrade Considerations

When pymrrc provides clear benefit:

  • Processing large files (100k+ records)
  • MARC processing is a bottleneck in your workflow
  • You need multi-threading for parallel file processing

When pymrrc may not be necessary:

  • Pure Python environment required (no C extensions)
  • Deeply integrated custom code with pymarc internals
  • Processing very small files (< 1k records)

Benchmark Selection

Which benchmark matches my use case?

  • Raw reading (test_benchmark_reading.py): best case for mrrc
  • Reading + field extraction (test_benchmark_reading.py): typical use
  • Round-trip read + write (test_benchmark_writing.py): format conversion
  • Parallel processing (test_benchmark_parallel.py, test_benchmark_pipeline_parallel.py): processing many files or one large file with threads (local runs only — see Results)

What does "rec/s" mean?

Records per second. Higher is better. Throughput depends on hardware, record size, and what your code does with each record — measure on your own data.


GIL and Threading

What is the GIL?

Global Interpreter Lock. Python's mechanism to make the interpreter thread-safe. Only one thread can execute Python bytecode at a time. When one thread releases the GIL, other threads can run.

Does pymrrc release the GIL?

Yes. During record parsing, the GIL is released via py.detach(). This happens automatically with every read_record() call.

Do I need to think about the GIL?

No. In single-threaded mode, it's automatic and transparent. In multi-threaded mode, the GIL release enables concurrent parsing.

What happens if I share a reader across threads?

Undefined behavior. The reader is not thread-safe. Create a new reader per thread.

# Wrong - do not share readers across threads
reader = MARCReader("file.mrc")
ThreadPoolExecutor().map(lambda _: reader.read_record(), range(4))

# Correct - each thread gets its own reader
def read_all(path):
    reader = MARCReader(path)
    while record := reader.read_record():
        # ... process ...

ThreadPoolExecutor().map(read_all, ["file1.mrc", "file2.mrc", "file3.mrc", "file4.mrc"])

Performance Tuning

How can I make pymrrc faster?

  1. File path input: Pass file paths directly instead of file objects
  2. Multi-threading: Use ThreadPoolExecutor for multiple files, ProducerConsumerPipeline for single files
  3. Batch processing: Read multiple records at once
  4. Use Rust: Rewrite critical path in Rust if possible

Does record size affect performance?

Minimally. Throughput stays consistent regardless of record size.

Does Python version matter?

Only for availability. pymrrc supports Python 3.10+. Performance is similar across versions.

Does OS matter?

Only for binary compatibility. pymrrc provides wheels for Linux, macOS, and Windows. Performance is similar across platforms.


Further Reading