MRRC Architecture¶
This document describes the architecture of the MRRC library and key design decisions.
Overview¶
MRRC is a Rust library for reading, writing, and manipulating MARC bibliographic records with Python bindings via PyO3. The library is organized into three main components:
- Core Rust Library - Pure Rust MARC record parsing and manipulation
- Python Wrapper - PyO3 bindings providing Python access with GIL release for concurrency
- Benchmarking Infrastructure - Comprehensive performance testing and profiling
Core Architecture¶
Record Types¶
MRRC supports three MARC record types:
- Bibliographic Records - Standard library catalog records (type 'a', 'c', 'm', etc.)
- Authority Records - Subject headings and name authority data (type 'z')
- Holdings Records - Physical item location and enumeration data (types 'x', 'y', 'v', 'u')
All record types share common infrastructure through the MarcRecord trait.
Data Structure¶
A MARC record consists of:
- Leader - 24-byte header with metadata
- Control Fields (000-009) - Fixed-length fields
- Data Fields (010+) - Variable-length fields with indicators and subfields
Parser Architecture¶
The core parser uses a state machine approach:
ISO 2709 Binary Format
↓
Record Boundary Scanner (finds 0x1D terminators)
↓
Leader Parser (24 bytes)
↓
Directory Parser (field offsets and lengths)
↓
Field Parser (control fields vs data fields)
↓
Subfield Parser (for data fields with indicators)
↓
Character Decoder (MARC-8 or UTF-8)
↓
Record Object
Python Wrapper Architecture¶
For a higher-level overview of how the Rust and Python code relate, how maturin and PyO3 fit in, and what builds what, see Project Layout.
GIL Release Strategy: Three-Phase Model¶
The Python wrapper implements a three-phase pattern for GIL management during every read_record() call:
Phase 1: Read bytes (GIL held)
↓
Phase 2: Parse bytes (GIL RELEASED) ← Concurrent work happens here
↓
Phase 3: Convert to Python object (GIL re-acquired)
Phase 1 (GIL held):
- Acquire raw record bytes from source
- Python file object: via Python read() method
- File path: via Rust std::fs::File (no GIL overhead)
- Bytes: already in memory (no I/O)
- Duration: Very short (I/O cached in kernel)
Phase 2 (GIL released):
- Parse record bytes to MARC structure (CPU-intensive work)
- Uses py.detach() (PyO3 0.23+) to explicitly release GIL
- Creates Rust ParseError without Python objects
- SmallVec buffer handles most records inline
- Duration: ~90% of total parse time
- Result: Multiple threads can parse concurrently
Phase 3 (GIL re-acquired):
- Convert Rust ParseError to Python exception (if needed)
- Convert Rust Record to Python PyRecord
- Return to caller
- Duration: Negligible (quick object construction)
Why GIL Release Matters¶
The Python GIL (Global Interpreter Lock) serializes all Python bytecode execution. Without GIL release during parsing:
Without GIL Release (current state of pure pymarc):
Thread 1: Read bytes (GIL) → Parse (GIL) → Convert (GIL)
Thread 2: Waiting... → Waiting... → Waiting...
Result: Threading provides no speedup (1.0x)
With GIL Release (pymrrc):
Thread 1: Read (GIL) → Parse (GIL RELEASED) → Convert
Thread 2: Read (GIL) → Parse (GIL RELEASED) → Convert
Result: Threads parse in parallel (3.74x on 4 cores)
The key insight: parsing is CPU-intensive but doesn't need Python objects, so releasing the GIL enables true parallelism.
Single-threaded benefit: Even without multiple threads, Rust parsing is simply faster (~4x vs pymarc).
Multi-threaded benefit: With explicit ThreadPoolExecutor, the GIL release enables concurrent parsing across threads (additional 3.74x speedup on 4 cores).
ReaderBackend Enum¶
The unified reader supports multiple input types via a backend enum:
enum ReaderBackend {
RustFile(std::fs::File), // Pure Rust I/O, zero GIL
Cursor(io::Cursor<Vec<u8>>), // In-memory, zero GIL
PythonFile(PyObject), // Python file object, GIL managed
}
Advantages:
- Automatic Detection: Input type determined at construction
- Optimal Performance: Each backend uses fastest available method
- Backward Compatible: Python file objects still work via GIL management
- Zero-GIL Paths: File paths and bytes bypass Python entirely
Performance Impact:
- File path: Pure Rust I/O, Phase 1 has minimal GIL hold
- Bytes: Zero I/O, Phase 1 is trivial
- File object: Requires GIL for
.read(), but Phase 2 still releases it
Batched Reader (Optimization)¶
For Python file objects, batching reduces GIL contention:
Without batching (N records):
FOR i = 1 to N:
Acquire GIL → Read 1 record → Release GIL → Parse
With batching (N records, batch size = 100):
FOR batch = 1 to N/100:
Acquire GIL → Read 100 records → Release GIL
FOR record in batch:
Parse record (GIL released)
Result: N/100 GIL acquisitions instead of N.
SmallVec Optimization¶
MARC records vary in size (typically 500-4000 bytes). The SmallVec buffer:
Benefits:
- Inline storage for ~85-90% of records (no allocation)
- Dynamic heap allocation for oversized records
- <3% memory overhead
- Eliminates borrow checker issues in Phase 2
Error Handling¶
ParseError Enum¶
Custom error type allows error creation without GIL:
pub enum ParseError {
InvalidRecord(String),
InvalidLeader(String),
InvalidDirectory(String),
EncodingError(String),
}
impl From<ParseError> for PyErr {
// Conversion happens after GIL re-acquisition in Phase 3
}
Why Custom Error Type? - PyErr requires GIL to create - ParseError can be created during Phase 2 (GIL released) - Defers PyErr conversion to Phase 3 (after GIL re-acquired)
Thread Safety¶
Not Send/Sync by Design¶
The readers are intentionally not Send or Sync:
// Readers hold Python references (not Send/Sync)
pub struct PyMARCReader {
reader: Option<ReaderType>,
// ReaderType may contain PythonFile(PyObject) which is !Send
}
Why? - Each thread needs its own GIL-aware reader - Sharing readers across threads causes undefined behavior - Forces correct usage pattern: one reader per thread
Concurrency Model¶
Two APIs for Different Use Cases¶
Standard MARCReader (Sequential - No Multi-Threading Benefit)¶
from mrrc import MARCReader
# Simple sequential reading
reader = MARCReader("records.mrc")
for record in reader:
process(record)
Performance: - ✅ Single-threaded: ~4x faster than pymarc - ❌ Multi-threaded: 0.85x slowdown (GIL contention) - Use when: Sequential processing or single-file reads
ProducerConsumerPipeline (High-Performance Single-File Multi-Threading)¶
from mrrc import ProducerConsumerPipeline
# Background producer thread reads file and parses with Rayon
pipeline = ProducerConsumerPipeline.from_file('large_file.mrc')
for record in pipeline:
process(record)
Verified Performance: - 2 threads: 2.0x speedup - 4 threads: 3.74x speedup - Scales with CPU core count
How it works: - Background producer thread reads file in 512 KB chunks - Bounded channel provides backpressure (1000 records) - Rayon parses batches in parallel on all CPU cores - Producer runs without GIL, eliminating contention
Use when: Processing a single large MARC file with maximum throughput from available cores
Performance Characteristics¶
Throughput (Records/Second)¶
| Mode | Throughput | Notes |
|---|---|---|
| Sequential (1 thread) | 549,500 rec/s | Baseline |
| Parallel (2 threads) | ~1.1M rec/s | ~2.0x speedup |
| Parallel (4 threads) | ~2.0M rec/s | ~3.74x speedup |
Memory Usage¶
| Scenario | Memory | Notes |
|---|---|---|
| Per reader | ~4 KB | SmallVec buffer |
| Per record (memory) | ~4 KB | Typical MARC record |
| Overhead (4 readers) | ~16 KB | Negligible |
GIL Contention¶
| Phase | GIL Status | Duration | Notes |
|---|---|---|---|
| Phase 1 | Held | Short | Read bytes only |
| Phase 2 | Released | Long | Parsing (CPU-bound) |
| Phase 3 | Held | Short | Convert to Python |
Character Encoding¶
MARC-8 Support¶
MARC-8 is a legacy encoding with: - Basic Latin (ASCII) - ANSEL Extended Latin with diacritical marks - Greek, Cyrillic, Arabic, Hebrew scripts - East Asian support (Chinese, Japanese, Korean) - Combining characters with Unicode NFC normalization
UTF-8 Support¶
Modern MARC records use UTF-8 (detected from leader position 9).
Automatic Detection¶
Character set detected from MARC leader: - Position 9: ' ' = MARC-8, 'a' = UTF-8 - Decoder selected automatically - Invalid bytes produce errors with context
Format Conversions¶
Supported Formats¶
- JSON: Generic field-based representation
- MARCJSON: Standard JSON-LD format (LOC spec)
- XML: Field/subfield XML structure
- CSV: Tabular export for spreadsheets
- Dublin Core: Simplified 15-element metadata
- MODS: Metadata Object Description Schema
- BIBFRAME: RDF/Linked Data (bidirectional conversion)
Conversion Approach¶
Each format has: 1. Serializer: Record → Format bytes 2. Deserializer: Format bytes → Record 3. Round-trip tests: Ensure lossless conversion
Testing¶
Test Categories¶
- Unit Tests: Individual components (parsers, builders, queries)
- Integration Tests: End-to-end workflows (read → process → write)
- Compatibility Tests: pymarc compatibility validation (75+ tests)
- Performance Tests: Benchmarking with Criterion.rs and pytest-benchmark
- Encoding Tests: MARC-8, UTF-8, multilingual records
Test Fixtures¶
Located in tests/data/:
- simple_book.mrc - Basic bibliographic record
- multi_records.mrc - Multiple records in one file
- simple_authority.mrc - Authority record
- simple_holdings.mrc - Holdings record
- with_control_fields.mrc - Record with 008 field
Benchmark Fixtures¶
Located in tests/data/fixtures/:
- 1k_records.mrc (257 KB) - Quick benchmarks
- 10k_records.mrc (2.5 MB) - Standard benchmarks
Key Design Principles¶
- Rust-Idiomatic: Uses iterators, Result types, ownership patterns
- Zero-Copy Where Possible: Efficient memory usage for large workloads
- Format Flexibility: Multiple serialization formats out of box
- Compatibility: Maintains data fidelity with pymarc
- Performance: Concurrent I/O with intelligent GIL management
- Safety: GIL release without unsafe code (except PyO3 glue)
References¶
Performance & Benchmarking:
- Performance Tuning Guide - Usage patterns and tuning
- Benchmarking Results - Detailed performance data
- Performance FAQ - Quick Q&A about speedups
Guides:
- Threading in Python - Thread safety and GIL behavior
External References: