Skip to content

Binary Format Research for MARC Data

Epic: mrrc-fks (Binary Format Evaluation)
Status: ✅ COMPLETE (9/10 evaluations done; strategy finalized)
Last Updated: 2026-01-19


Purpose

Systematic evaluation of modern binary serialization formats for MARC bibliographic data, assessing round-trip fidelity, robustness, performance, and ecosystem fit. Produces actionable recommendations for format support in mrrc library and Python wrapper.

Critical Constraint: Formats that reorder fields or subfield codes are rejected (data loss for semantics-preserving applications). All recommended formats achieve 100% perfect round-trip preservation.


Document Guide (Start Here)

For Decision-Makers

  1. FORMAT_SUPPORT_STRATEGY.mdSTART HERE
  2. Executive overview and format recommendations
  3. Cost/benefit analysis for Tier 1, 2, 3 formats
  4. Implementation roadmap (phases, effort, timeline)
  5. Final decision matrix (11 days to ship Tier 1 + 2)

  6. COMPARISON_MATRIX.md

  7. Side-by-side performance metrics (read/write/roundtrip speed)
  8. File size and compression analysis
  9. Memory efficiency benchmarks
  10. Use case fit scoring (API, batch processing, analytics, archival)
  11. Customer personas and format recommendations

For Implementers

  1. EVALUATION_FRAMEWORK.md
  2. Standardized evaluation methodology
  3. Three-layer assessment (fidelity → robustness → performance)
  4. Pass/fail criteria for each layer
  5. Edge case definitions (15 MARC-specific test cases)

  6. FIDELITY_TEST_SET.md

  7. 100-record test collection with edge cases
  8. Records designed to stress field/subfield ordering, whitespace, UTF-8
  9. Verification procedure for round-trip fidelity

For Reference

  1. BASELINE_ISO2709.md
  2. Baseline performance characteristics (903k rec/sec read)
  3. ISO 2709 binary format specification reference
  4. Throughput derivation methodology

  5. Individual Format Evaluations:

  6. EVALUATION_PROTOBUF.md (mrrc-fks.1) — ✅ Complete
  7. EVALUATION_FLATBUFFERS.md (mrrc-fks.2) — ✅ Complete
  8. EVALUATION_ARROW.md (mrrc-fks.7) — ✅ Complete
  9. EVALUATION_MESSAGEPACK.md (mrrc-fks.5) — ✅ Complete
  10. EVALUATION_CBOR.md (mrrc-fks.6) — ✅ Complete
  11. EVALUATION_AVRO.md (mrrc-fks.4) — ✅ Complete
  12. EVALUATION_PARQUET.md (mrrc-ks7) — ✅ Complete
  13. EVALUATION_POLARS_ARROW_DUCKDB.md (mrrc-fks.10) — ✅ Complete

Evaluation Summary

Format Decisions

TIER 1: MUST SHIP (Non-Negotiable) - ✅ ISO 2709 — Baseline; 50+ year proven standard; 900k rec/sec - ✅ Protobuf — Modern API; schema evolution; multi-language; gRPC

TIER 2: SHIP TOGETHER (High ROI; 7 dev days total) - ✅ Arrow (Columnar) — 3 days | Analytics + ecosystem standard (865k rec/sec) - ✅ FlatBuffers — 2 days | Mobile/embedded + zero-copy (259k rec/sec; 64% memory savings) - ✅ MessagePack — 2 days | Compact + universal (750k rec/sec; 25% file size savings)

TIER 3: DEFER (Implement on customer demand only) - ⏸️ CBOR — Government/academic archival (2 days) - ⏸️ Avro — Kafka data lake integration (2 days) - ⏸️ Arrow Analytics — Discovery optimization (1 day; POC complete)

EXCLUDE: Do Not Implement - ❌ Parquet — Redundant with Arrow (use Arrow IPC → external export) - ❌ JSON/YAML/XML — Different project scope (handled in pymarc) - ❌ Bincode — Rust-only; no cross-platform appeal - ❌ Ion — Unclear MARC value; Protobuf superior

Performance Highlights

Format Read Speed vs ISO 2709 Memory File Size Fidelity
ISO 2709 903k rec/sec Baseline 45 MB 2.6 MB ✅ 100%
Protobuf 100k rec/sec -88.9% 45-50 MB 7.5-8.5 MB ✅ 100%
FlatBuffers 259k rec/sec -71.3% 16 MB (-64%) 6.7 MB ✅ 100%
Arrow 865k rec/sec -4.2% 30-35 MB 1.8 MB (-30%) ✅ 100%
MessagePack 750k rec/sec -17.0% 40-45 MB 2.0 MB (-25%) ✅ 100%
Arrow Analytics 1.77M rec/sec +1.96× 5.8 MB 1.8 MB ✅ 100%

Evaluation Methodology

All formats evaluated via three-layer assessment:

  1. Layer 1: Fidelity — 100% round-trip preservation (field/subfield ordering, indicators, UTF-8 content) against 100-record test set with 15 edge cases. Pass/fail gate.

  2. Layer 2: Robustness — Graceful error handling on 7 malformed inputs (truncated records, invalid tags, oversized fields, malformed UTF-8, etc.). No panics. Pass/fail gate.

  3. Layer 3: Performance — Read/write throughput, file size, memory efficiency benchmarks. Only evaluated if Layers 1+2 pass.


File Structure & Cross-References

docs/design/format-research/
├── README.md ← YOU ARE HERE
├── FRAMEWORK & PLANNING
│   ├── EVALUATION_FRAMEWORK.md     (mrrc-fks.8) — Evaluation methodology
│   ├── FIDELITY_TEST_SET.md        — 100-record test data + edge cases
│   ├── TEMPLATE_evaluation.md      — Blank report template
│   └── REVISIONS_SUMMARY.md        — Document evolution log
├── SYNTHESIS & STRATEGY (mrrc-fks.9)
│   ├── COMPARISON_MATRIX.md        — Aggregated results (all formats)
│   └── FORMAT_SUPPORT_STRATEGY.md  — Final recommendations + roadmap
├── BASELINES & REFERENCES
│   └── BASELINE_ISO2709.md         — ISO 2709 performance baseline
└── INDIVIDUAL EVALUATIONS (Completed)
    ├── EVALUATION_PROTOBUF.md               (mrrc-fks.1)  ✅
    ├── EVALUATION_FLATBUFFERS.md            (mrrc-fks.2)  ✅
    ├── EVALUATION_PARQUET.md                (mrrc-ks7)    ✅
    ├── EVALUATION_AVRO.md                   (mrrc-fks.4)  ✅
    ├── EVALUATION_MESSAGEPACK.md            (mrrc-fks.5)  ✅
    ├── EVALUATION_CBOR.md                   (mrrc-fks.6)  ✅
    ├── EVALUATION_ARROW.md                  (mrrc-fks.7)  ✅
    └── EVALUATION_POLARS_ARROW_DUCKDB.md    (mrrc-fks.10) ✅

Key Findings

Format Tiers Justified by ROI

Why Tier 1 + 2 (11 days) is sufficient: - ISO 2709 + Protobuf (4-5 days): Covers 100% of legacy + modern API users - Arrow + FlatBuffers + MessagePack (7 days): Solves distinct personas: - Arrow: Data scientists (analytics, DuckDB/Polars integration) - FlatBuffers: Mobile/embedded developers (64% memory savings, zero-copy) - MessagePack: REST API developers (25% file size, 50+ languages) - Tier 3 (on-demand): Niche verticals (Kafka, government archival) defer without blocking release

Why Specific Formats Were Excluded

Format Decision Rationale
Parquet ❌ EXCLUDE Redundant with Arrow; user → Arrow IPC → external Parquet (3-line code)
Bincode ❌ EXCLUDE Fast serde (~80% of MessagePack), but Rust-only; no cross-platform appeal
Ion ❌ EXCLUDE Excellent flexibility but low ecosystem (6 languages); Protobuf superior for schema
JSON Lines ⏸️ RESEARCH Post-release; valuable for dev ergonomics but outside binary format scope
Custom MARC Schema ⏸️ RESEARCH Interesting for ISO 2709 evolution; requires community buy-in

Implementation Roadmap (Format Support Strategy)

Phase 0 (Foundation): 1.5 days — Traits, module structure, test fixtures
Phase 1 (Core): 3-5 days — ISO 2709 refactor + Protobuf
Phase 2 (High-Value): 6-8 days — Arrow, FlatBuffers, MessagePack (parallelizable)
Phase 4 (Polish): 5-7 days — Python wrapper + documentation

Critical Path: 15-18 days wall time (Tier 1 + 2 complete)
MVP Option: 7-8 days (Tier 1 only; ship Tier 2 in v1.1 if needed)

See FORMAT_SUPPORT_STRATEGY.md Part 5 for detailed task breakdown.


Issue Title Status
mrrc-fks Binary Format Evaluation Epic ✅ Closed
mrrc-fks.1 Protobuf Evaluation ✅ Complete
mrrc-fks.2 FlatBuffers Evaluation ✅ Complete
mrrc-fks.3 Parquet Evaluation ✅ Complete
mrrc-fks.4 Avro Evaluation ✅ Complete
mrrc-fks.5 MessagePack Evaluation ✅ Complete
mrrc-fks.6 CBOR Evaluation ✅ Complete
mrrc-fks.7 Arrow Evaluation ✅ Complete
mrrc-fks.8 Evaluation Framework ✅ Complete
mrrc-fks.9 Format Strategy & Recommendations ✅ Complete
mrrc-fks.10 Arrow Analytics (Polars/DuckDB) ✅ Complete

Follow-Up Work (Not Blocking)

Identified during evaluation but deferred:

  • mrrc-fks.11: Streaming Arrow IPC evaluation (>100M records)
  • mrrc-fks.12: Protobuf/FlatBuffers schema evolution upgrade testing
  • mrrc-fks.13: Cross-language round-trip verification (Rust ↔ Python ↔ Java)
  • Optional: Bincode POC, JSON Lines benchmarking, Custom MARC Binary Schema research

How to Use This Research

For Release Planning: → Read FORMAT_SUPPORT_STRATEGY.md Part 10 (Final Recommendations)

For Implementation: → Use FORMAT_SUPPORT_STRATEGY.md Part 5 (Detailed Phase Breakdown)

For Performance Analysis: → Consult COMPARISON_MATRIX.md (aggregated metrics)

For Customer Questions: → Use COMPARISON_MATRIX.md customer personas (Part 7.2)

For Format-Specific Details: → Read individual evaluation documents (e.g., EVALUATION_PROTOBUF.md for Protobuf design choices)


Document Metadata

Item Value
Framework Version 2.0 (mrrc-fks.8, 2026-01-14)
Comparison Matrix Version 2.2 (mrrc-fks.9, 2026-01-19)
Strategy Version 1.0 (mrrc-fks.9 follow-up, 2026-01-19)
Total Evaluation Effort ~120+ person-hours (8 formats evaluated)
Formats Evaluated 8 (9 with Arrow Analytics)
Formats Recommended 7 (Tier 1 + 2 + Analytics)
Edge Cases Tested 15 per format
Test Records 100+ per format
Baseline Throughput 903,560 rec/sec (ISO 2709)

Questions?

  • Implementation details? → See FORMAT_SUPPORT_STRATEGY.md Part 5 (Phase breakdown)
  • Performance comparisons? → See COMPARISON_MATRIX.md (performance metrics)
  • Why was format X excluded? → See FORMAT_SUPPORT_STRATEGY.md Part 1.3 (exclusions + rationale)
  • What's the timeline? → See FORMAT_SUPPORT_STRATEGY.md Part 10.3 (timeline + resources)
  • How do I add a new format? → See EVALUATION_FRAMEWORK.md + TEMPLATE_evaluation.md