Character Encoding¶

MRRC supports both MARC-8 (legacy) and UTF-8 character encodings, with automatic conversion.

Encoding Overview¶

Encoding	Leader Position 09	Description
MARC-8	(blank/space)	Legacy encoding with escape sequences for non-Latin scripts
UTF-8	`a`	Unicode, modern standard

MRRC handles encoding automatically:

Detects encoding from leader position 09
Converts MARC-8 to UTF-8 when reading
Stores all strings internally as UTF-8
Can write to either encoding

UTF-8 (Modern Standard)¶

UTF-8 is the recommended encoding for new records. It supports all Unicode characters directly without escape sequences.

Reading UTF-8 records:

PythonRust

from mrrc import MARCReader

for record in MARCReader("utf8_records.mrc"):
    # All strings are already UTF-8
    print(record.title())

use mrrc::MarcReader;

let mut reader = MarcReader::new(file);
while let Some(record) = reader.read_record()? {
    // All strings are Rust String (UTF-8)
    println!("{:?}", record.title());
}

MARC-8 (Legacy)¶

MARC-8 is a Library of Congress encoding that predates Unicode. It uses escape sequences to switch between character sets.

Supported Character Sets¶

MRRC supports all standard MARC-8 character sets:

Character Set	Code	Description
Basic Latin	42 (B)	ASCII characters
Extended Latin (ANSEL)	45 (E)	Diacritics and extended Latin
Basic Hebrew	32 (2)	Hebrew alphabet
Basic Arabic	33 (3)	Arabic script
Extended Arabic	34 (4)	Extended Arabic variants
Basic Cyrillic	4E (N)	Cyrillic alphabet
Extended Cyrillic	51 (Q)	Extended Cyrillic
Basic Greek	53 (S)	Greek alphabet
Subscript	62 (b)	Mathematical subscripts
Superscript	70 (p)	Mathematical superscripts
Greek Symbols	67 (g)	Greek letters in symbols
EACC	31 (1)	East Asian (CJK, 15,000+ characters)

Escape Sequences¶

MARC-8 uses escape sequences (starting with 0x1B) to switch character sets:

ESC + intermediate chars + final char → Switch character set

For example: - ESC ( B → Switch G0 to Basic Latin - ESC $ 1 → Switch G0 to EACC (East Asian)

You don't need to handle escape sequences manually - MRRC decodes them automatically.

Combining Marks (Diacritics)¶

MARC-8 represents diacritics as combining marks that precede their base character:

MARC-8:  [combining acute] + e → é
Unicode: e + [combining acute] → é (or precomposed é)

MRRC normalizes these to Unicode combining sequences.

Encoding Detection¶

Check a record's declared encoding via the leader:

PythonRust

from mrrc import MARCReader

for record in MARCReader("records.mrc"):
    # Check what encoding the record declares
    leader = record.leader()
    if leader.character_coding == 'a':
        print("Record declares UTF-8")
    else:
        print("Record declares MARC-8")

use mrrc::{MarcReader, encoding::MarcEncoding};

let encoding = MarcEncoding::from_leader_char(leader.position_9())?;
match encoding {
    MarcEncoding::Utf8 => println!("UTF-8"),
    MarcEncoding::Marc8 => println!("MARC-8"),
}

Writing with Specific Encoding¶

By default, MRRC writes UTF-8. To write MARC-8:

PythonRust

from mrrc import MARCWriter, Encoding

with MARCWriter("output.mrc", encoding=Encoding.MARC8) as writer:
    writer.write(record)

use mrrc::{MarcWriter, encoding::MarcEncoding};

let mut writer = MarcWriter::with_encoding(
    File::create("output.mrc")?,
    MarcEncoding::Marc8
);
writer.write_record(&record)?;

Mixed Encoding Handling¶

Some legacy records have inconsistent encoding - the leader says MARC-8 but some fields contain UTF-8 (or vice versa).

In Rust, the encoding validator can detect this programmatically:

use mrrc::encoding::EncodingValidator;

let analysis = EncodingValidator::analyze_encoding(&record)?;
match analysis {
    EncodingAnalysis::Consistent(enc) => {
        println!("Consistent encoding: {:?}", enc);
    }
    EncodingAnalysis::Mixed { primary, .. } => {
        println!("Warning: mixed encoding detected");
    }
    EncodingAnalysis::Undetermined => {
        println!("Could not determine encoding");
    }
}

In Python, MRRC handles encoding conversion automatically when reading records. If you encounter encoding issues, check the leader's character_coding property and compare it with the actual content.

Common Issues¶

Mojibake (Garbled Text)¶

Record declares UTF-8 but contains MARC-8
Record declares MARC-8 but contains UTF-8
File was saved with wrong encoding

Solution: Check the leader position 9 and verify it matches the actual data.

Missing Characters¶

If characters display as ? or \uFFFD:

The character may not be in the MARC-8 character tables
The character may be from an unsupported script
The data may be corrupted

East Asian Text (CJK)¶

MARC-8 uses EACC (East Asian Character Code) for Chinese, Japanese, and Korean:

Uses 3-byte sequences (escape + 2 bytes)
MRRC supports 15,000+ EACC characters
Modern records should use UTF-8 for CJK

Best Practices¶

Use UTF-8 for new records - Simpler, universal character support
Preserve original encoding when round-tripping - Read MARC-8, write MARC-8 if you need exact byte preservation
Validate encoding before batch processing - Check a sample of records for consistency
Handle encoding errors gracefully - Some legacy records have encoding issues

Character Encoding¶

Encoding Overview¶

UTF-8 (Modern Standard)¶

MARC-8 (Legacy)¶

Supported Character Sets¶

Escape Sequences¶

Combining Marks (Diacritics)¶

Encoding Detection¶

Writing with Specific Encoding¶

Mixed Encoding Handling¶

Common Issues¶

Mojibake (Garbled Text)¶

Missing Characters¶

East Asian Text (CJK)¶

Best Practices¶

See Also¶