Python Wrapper Implementation Strategies¶
Date: 2025-12-28
Status: DRAFT - Strategy Documents
Task: mrrc-9ic.9
This document codifies pre-implementation strategies for the Python wrapper to ensure Phase 1 can proceed smoothly without discovering critical gaps during implementation.
1. Type Hint & IDE Support Strategy¶
1.1 Overview¶
Python users expect IDE autocomplete and type checking support. PyO3 generates .pyi stub files; we must ensure they're discoverable and correct.
1.2 Approach¶
Step 1: Configure Maturin for .pyi Generation¶
# pyproject.toml
[tool.maturin]
python-packages = ["mrrc"]
module-name = "mrrc._mrrc" # Native module name
Step 2: Expose Python Types via init.py¶
# src-python/mrrc/__init__.py
from mrrc._mrrc import (
Record,
Field,
Leader,
MARCReader,
MARCWriter,
MarcException,
)
__all__ = [
"Record",
"Field",
"Leader",
"MARCReader",
"MARCWriter",
"MarcException",
]
Step 3: Add py.typed Marker (PEP 561)¶
Create empty file src-python/mrrc/py.typed (no content needed)
- Signals to type checkers that this package has type information
- Must be included in wheel distribution
Step 4: Validate Type Hints¶
# Install type checking tools
pip install mypy pyright pytest
# Test with mypy
mypy tests/python/ --strict
# Test with pyright
pyright tests/python/
1.3 Configuration¶
# pyproject.toml additions
[tool.mypy]
python_version = "3.9"
warn_return_any = true
warn_unused_configs = true
disallow_untyped_defs = true
disallow_incomplete_defs = true
[tool.pyright]
pythonVersion = "3.9"
typeCheckingMode = "strict"
1.4 Docstring Format (for IDE inference)¶
Use Google-style docstrings with type hints:
/// Read the next MARC record from the input stream.
///
/// # Returns
/// A Record if more data is available, None if EOF reached.
///
/// # Errors
/// Returns an error if the MARC data is malformed or encoding fails.
///
/// # Example
/// ```python
/// reader = MARCReader(open("records.mrc", "rb"))
/// record = reader.read_record()
/// if record:
/// print(record.title())
/// ```
#[pyo3(text_signature = "(self) -> Optional[Record]")]
pub fn read_record(&mut self) -> PyResult<Option<PyRecord>> {
// ...
}
1.5 Testing Type Hints¶
# tests/python/test_types.py
from mrrc import Record, Field, MARCReader
from typing import Optional
def test_reader_returns_record() -> None:
"""Verify type hints are correct."""
reader: MARCReader = MARCReader(open("test.mrc", "rb"))
record: Optional[Record] = reader.read_record()
assert isinstance(record, Record) or record is None
1.6 Build Configuration¶
# In GitHub Actions or CI:
# Step 1: Build wheels (maturin will generate .pyi files)
maturin build --release
# Step 2: Unpack wheel and verify .pyi files exist
unzip dist/mrrc-*.whl -d /tmp/wheel_check
ls /tmp/wheel_check/mrrc/*.pyi
# Step 3: Type check against generated stubs
mypy tests/python/
2. Python Documentation Strategy¶
2.1 Overview¶
Rust doc comments don't generate Python documentation. We need a clear approach for building Python API docs.
2.2 Docstring Convention¶
Use Google-style docstrings in Rust code (PyO3 will include them in __doc__):
#[pyclass(name = "Record")]
pub struct PyRecord {
inner: mrrc::Record,
}
#[pymethods]
impl PyRecord {
/// Create a new MARC record with the given leader.
///
/// Args:
/// leader: A Leader object defining record type and encoding.
///
/// Returns:
/// A new Record instance.
///
/// Example:
/// >>> from mrrc import Record, Leader
/// >>> leader = Leader(...)
/// >>> record = Record(leader)
#[new]
pub fn new(leader: PyLeader) -> Self {
PyRecord {
inner: mrrc::Record::new(leader.inner.clone()),
}
}
/// Get the title of the record.
///
/// Extracts the main title from field 245 subfield 'a'.
///
/// Returns:
/// The title string, or None if not present.
pub fn title(&self) -> Option<String> {
self.inner.title()
}
}
2.3 Documentation Generation¶
Option A: Sphinx with sphinx-autodoc¶
# Generate HTML docs from docstrings
pip install sphinx sphinx-rtd-theme
# Create docs/conf.py
# Run: sphinx-build -b html docs docs/_build/
Option B: mkdocs (Simpler)¶
# mkdocs.yml
site_name: mrrc Python API
theme:
name: material
nav:
- Home: index.md
- API Reference:
- Record: api/record.md
- Field: api/field.md
- MARCReader: api/reader.md
Option C: Manual Docs (Starting Point)¶
# docs/python_api.md
## Record Class
### Record(leader)
Create a new MARC record.
**Parameters:**
- `leader` (Leader): Record leader
**Returns:** Record instance
**Example:**
```python
from mrrc import Record, Leader
record = Record(leader)
**Recommendation:** Start with Option C (manual), move to Option B (mkdocs) once stable.
### 2.4 Publishing Docs
- Build as part of CI (not PyPI release)
- Publish to GitHub Pages (`docs/` branch)
- Include README quick-start examples
---
## 3. Benchmarking Framework Strategy
### 3.1 Overview
We want to quantify performance gains over `pymarc`. Need reproducible, isolated benchmarks.
### 3.2 Test Data Generation
```python
# tests/python/conftest.py
import pytest
from pathlib import Path
from mrrc import Record, Field, Leader
@pytest.fixture(scope="session")
def sample_records_100k():
"""Generate 100k sample MARC records for benchmarking."""
records = []
for i in range(100000):
leader = Leader(
record_type='a',
bibliographic_level='m',
...
)
record = Record(leader)
record.add_control_field("001", f"00000{i:05d}")
record.add_field(
Field("245", '1', '0')
.add_subfield('a', f"Record {i} /")
.add_subfield('c', "Author Name.")
)
records.append(record)
return records
@pytest.fixture(scope="session")
def sample_mrc_file_100k(tmp_path_factory, sample_records_100k):
"""Write 100k records to a temporary .mrc file."""
mrc_file = tmp_path_factory.mktemp("data") / "records_100k.mrc"
writer = MARCWriter(open(mrc_file, "wb"))
for record in sample_records_100k:
writer.write_record(record)
return mrc_file
3.3 Benchmark Suite¶
# tests/python/test_benchmark_reader.py
import pytest
from mrrc import MARCReader
@pytest.mark.benchmark(group="reader")
def test_read_100k_records(benchmark, sample_mrc_file_100k):
"""Benchmark reading 100k MARC records."""
def read_all():
records = []
reader = MARCReader(open(sample_mrc_file_100k, "rb"))
while record := reader.read_record():
records.append(record)
return records
result = benchmark(read_all)
assert len(result) == 100_000
@pytest.mark.benchmark(group="reader")
def test_read_and_extract_titles(benchmark, sample_mrc_file_100k):
"""Benchmark reading and extracting field data."""
def read_and_extract():
titles = []
reader = MARCReader(open(sample_mrc_file_100k, "rb"))
while record := reader.read_record():
if title := record.title():
titles.append(title)
return titles
result = benchmark(read_and_extract)
assert len(result) > 0
@pytest.mark.benchmark(group="writer")
def test_write_100k_records(benchmark, sample_records_100k, tmp_path):
"""Benchmark writing 100k MARC records."""
def write_all():
output = tmp_path / "output.mrc"
writer = MARCWriter(open(output, "wb"))
for record in sample_records_100k:
writer.write_record(record)
benchmark(write_all)
3.4 Running Benchmarks¶
# Run with pytest-benchmark
pytest tests/python/test_benchmark_*.py -v --benchmark-only
# Generate HTML report
pytest tests/python/ --benchmark-json=.benchmarks/results.json
pytest-benchmark compare .benchmarks/results.json
# Compare with pymarc (if available)
pip install pymarc
pytest tests/python/test_benchmark_comparison.py --benchmark-compare
3.5 Comparison Metrics¶
Store results for comparison: - Throughput (records/sec) - Memory (RSS peak MB) - Variance (std dev %) - Baseline (pymarc, if available)
4. CI/CD Workflow Strategy¶
4.1 Build Matrix¶
# .github/workflows/python-build.yml
name: Python Build & Test
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
build-wheels:
name: Build wheels
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
python-version: ["3.9", "3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@v3
- uses: PyO3/maturin-action@v1
with:
python-version: ${{ matrix.python-version }}
manylinux: auto
args: --release
- uses: actions/upload-artifact@v3
with:
name: wheels-${{ matrix.os }}-${{ matrix.python-version }}
path: dist
test-wheels:
name: Test wheels
needs: build-wheels
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
python-version: ["3.9", "3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- uses: actions/download-artifact@v3
with:
name: wheels-${{ matrix.os }}-${{ matrix.python-version }}
path: dist
- run: |
pip install dist/*.whl
pip install pytest pytest-benchmark mypy pyright
- run: pytest tests/python/ -v
- run: mypy tests/python/ --strict
4.2 Release Workflow¶
# .github/workflows/python-release.yml
name: Python Release
on:
push:
tags:
- 'v*'
jobs:
build-release:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
python-version: ["3.9", "3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@v3
- uses: PyO3/maturin-action@v1
with:
python-version: ${{ matrix.python-version }}
manylinux: auto
args: --release
- uses: actions/upload-artifact@v3
with:
name: wheels
path: dist
publish:
needs: build-release
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/download-artifact@v3
with:
name: wheels
path: dist
- uses: pypa/gh-action-pypi-publish@release/v1
with:
password: ${{ secrets.PYPI_API_TOKEN }}
5. GIL Behavior Documentation¶
5.1 GIL Release Policy¶
File I/O operations release the GIL:
// src-python/src/reader.rs
#[pyo3(text_signature = "(self)")]
pub fn read_record(&mut self) -> PyResult<Option<PyRecord>> {
// The native Rust code runs without the GIL
// This allows Python threads to run concurrently
let result = self.inner.read_record()
.map_err(|e| e.into())?
.map(|r| PyRecord { inner: r });
Ok(result)
}
5.2 Threading Examples¶
# docs/threading.md
## Using mrrc with Threading
mrrc's I/O operations release the Python GIL, allowing true parallelism:
### Example: Parallel Reading with multiprocessing
```python
from multiprocessing import Pool
from mrrc import MARCReader
def process_records(filename):
records = []
reader = MARCReader(open(filename, "rb"))
while record := reader.read_record():
records.append(record.title() or "Unknown")
return records
if __name__ == "__main__":
with Pool(4) as pool:
results = pool.map(process_records, [
"file1.mrc",
"file2.mrc",
"file3.mrc",
"file4.mrc",
])
5.3 Performance Notes¶
- GIL released: File read/write, MARC parsing, field access
- GIL held: Python exception creation, type conversions
- Implication: I/O-bound workloads see near-linear speedup with threading
6. Error Handling Strategy (Decision Required)¶
See mrrc-9ic.6 for the following decisions:
Option A: Auto-Conversion (Simplest)¶
// PyO3 auto-converts Result errors to Python exceptions
#[pymethods]
impl PyRecord {
pub fn add_field(&mut self, field: PyField) -> PyResult<()> {
self.inner.add_field(field.inner.clone())
.map_err(|e| PyErr::new::<PyException>(e.to_string()))
}
}
Option B: Custom Exception (Recommended)¶
// Define custom Python exception hierarchy
#[pyclass(extends = PyException)]
pub struct MarcException;
#[pyclass(extends = MarcException)]
pub struct MarcEncodingError;
impl From<mrrc::error::MarcError> for PyErr {
fn from(err: mrrc::error::MarcError) -> Self {
match err {
mrrc::error::MarcError::EncodingError(_) =>
PyErr::new::<MarcEncodingError>(err.to_string()),
_ => PyErr::new::<MarcException>(err.to_string()),
}
}
}
Decision maker: mrrc-9ic.6 task
7. Implementation Checklist (Before Phase 1)¶
- [ ] Decisions resolved in mrrc-9ic.6 (package name, Python version, error handling)
- [ ] Type hint strategy confirmed (mypy/pyright config written)
- [ ] Documentation approach chosen (Sphinx, mkdocs, or manual)
- [ ] Test data generation script written (conftest.py)
- [ ] Benchmark skeleton created (test_benchmark_*.py)
- [ ] CI/CD workflows drafted (.github/workflows/)
- [ ] GIL behavior documented for users
- [ ] pymarc API audit completed (mrrc-9ic.7)