Concurrency (Python)¶

Learn to process MARC records in parallel using Python.

Quick Reference¶

What You're Doing	Approach	Typical Speedup
Reading a single file	File path + sequential	1x (but GIL-friendly)
Processing multiple files	ThreadPoolExecutor	2-3x
Processing one large file	ProducerConsumerPipeline	3-4x

Why Concurrency?¶

MRRC releases Python's GIL during record parsing, enabling true parallel processing:

2 threads: ~2x speedup
4 threads: ~3x speedup
Ideal for processing multiple files or large datasets

Reading a Single File¶

For single-file processing, pass file paths directly. This uses pure Rust I/O and releases the GIL during parsing, making your code "concurrency-ready" even in sequential use:

from mrrc import MARCReader

# Recommended: file path (GIL released during I/O)
for record in MARCReader("records.mrc"):
    print(record.title())

Avoid file objects when possible—they hold the GIL during Python I/O:

# Slower: file object (GIL held for Python I/O)
with open("records.mrc", "rb") as f:
    for record in MARCReader(f):
        print(record.title())

Use file objects only when needed (e.g., network streams, custom I/O).

Processing Multiple Files¶

When you have many files to process, use ThreadPoolExecutor to read them in parallel:

from concurrent.futures import ThreadPoolExecutor
from mrrc import MARCReader

def process_file(path):
    """Process a single MARC file."""
    count = 0
    for record in MARCReader(path):
        if record.title():
            count += 1
    return count

# Process files in parallel (one thread per file)
files = ["file1.mrc", "file2.mrc", "file3.mrc", "file4.mrc"]
with ThreadPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(process_file, files))

print(f"Total records: {sum(results)}")

Each thread gets its own reader, and the GIL is released during parsing, so threads run truly in parallel.

Processing a Large File¶

When you have a single large file, use ProducerConsumerPipeline to parallelize parsing:

from mrrc import ProducerConsumerPipeline

# Create pipeline (auto-scales to CPU cores)
pipeline = ProducerConsumerPipeline.from_file("large_file.mrc")

# Process records (arrives in order)
for record in pipeline:
    print(record.title())

The pipeline achieves ~3.7x speedup on 4 cores by splitting work:

Producer thread: Reads record bytes from disk
Parser threads: Parse bytes into records in parallel (GIL released)
Consumer: Receives parsed records in original order

Thread Safety¶

Safe patterns:

Create one reader per thread
Use file paths for maximum parallelism
Use ThreadPoolExecutor for multi-file processing
Use ProducerConsumerPipeline for single large files

Unsafe patterns:

Sharing a MARCReader across threads
Passing file objects between threads
Modifying the same Record from multiple threads

Complete Example¶

#!/usr/bin/env python3
"""Process MARC files in parallel."""

from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
from mrrc import MARCReader

def extract_titles(path):
    """Extract all titles from a MARC file."""
    titles = []
    for record in MARCReader(path):
        if title := record.title():
            titles.append(title)
    return path.name, titles

def main():
    # Find all .mrc files
    marc_files = list(Path("data").glob("*.mrc"))
    print(f"Found {len(marc_files)} MARC files")

    all_titles = {}

    # Process in parallel
    with ThreadPoolExecutor(max_workers=4) as executor:
        futures = {executor.submit(extract_titles, f): f for f in marc_files}

        for future in as_completed(futures):
            filename, titles = future.result()
            all_titles[filename] = titles
            print(f"{filename}: {len(titles)} titles")

    total = sum(len(t) for t in all_titles.values())
    print(f"Total: {total} titles from {len(marc_files)} files")

if __name__ == "__main__":
    main()

Performance Comparison¶

Typical speedups on a 4-core system:

Approach	Speedup	Best For
Sequential (file path)	1x	Simple scripts, small files
ThreadPoolExecutor (2 threads)	2.0x	A few files
ThreadPoolExecutor (4 threads)	3.2x	Many files
ProducerConsumerPipeline	3.7x	One large file

Next Steps¶

Reading Records - Basic record access
Threading Guide - GIL behavior and advanced patterns
Performance Tuning - Optimization tips