Resources & Guides

MARC Character Encoding Problems: MARC-8, UTF-8, and Broken Accents

Understand MARC character encoding problems, including MARC-8 vs UTF-8, broken accents, and how to test records before importing into Koha.

May 19, 2026 4 min read

MARC Character Encoding Problems: MARC-8, UTF-8, and Broken Accents

Character encoding problems are among the most frustrating issues in MARC migration work. A file may import into Koha, but titles, names, subjects, or notes may display with broken accents or strange characters.

This often happens when the system reading the file interprets the text using the wrong encoding.

Before importing a full catalogue, check encoding carefully.

What is character encoding?

Character encoding tells a computer how to store and display text.

In MARC workflows, two common encodings are:

MARC-8, an older character encoding used in many legacy MARC records.
UTF-8, a modern Unicode encoding widely used by current systems.

A record may contain the right data but display incorrectly if the encoding is interpreted wrongly.

What broken encoding looks like

Encoding problems may appear as:

replacement characters;
question marks where accents should be;
names with broken diacritics;
symbols appearing in titles;
non-Latin scripts not displaying correctly;
text that looks right in one tool and wrong in another.

Examples of affected fields include:

100 author names;
245 titles;
260/264 publication data;
500 notes;
600/650 subjects;
700 added entries.

Why encoding matters in Koha

Encoding problems can affect more than display. They may also affect:

search;
sorting;
authority matching;
duplicate detection;
patron trust in the catalogue;
staff confidence after migration.

If a patron searches for a title or author with accented characters, broken encoding may make the record harder to find.

Common causes

1. MARC-8 file treated as UTF-8

This can happen when an import setting assumes UTF-8 but the source file is actually MARC-8.

2. UTF-8 file treated as MARC-8

The reverse can also happen. The data is modern UTF-8, but the import process or conversion tool assumes MARC-8.

3. Mixed encoding in one project

A migration project may contain records from multiple export sources. Some batches may be UTF-8, while others are MARC-8 or partially corrupted.

4. Bad conversions from old systems

Older systems sometimes export records that already contain corrupted characters. In that case, conversion alone may not fully fix the problem.

5. Copy-and-paste artefacts

Records created or edited manually may contain smart quotes, tabs, non-breaking spaces, or other characters that cause later problems.

How to test encoding before full import

Use a sample that includes:

accented names;
non-English titles;
apostrophes and quotation marks;
symbols;
records from different parts of the export;
older records and newer records;
vendor-supplied records if included.

Then:

Open the sample in a MARC-aware tool.
Upload it to MARCReady.
Review any encoding warnings.
Export a repaired sample.
Stage the sample in Koha.
Check staff and OPAC display.
Search for accented names and titles.
Confirm the display is acceptable before importing the full file.

How MARCReady handles encoding

MARCReady detects the file encoding on upload and normalises subfield values to Unicode NFC. MARC-8 binary files are converted to UTF-8 during parsing, and all output is always exported as UTF-8.

Encoding artefacts — such as common broken accent patterns or unusual character sequences — are identified in the review log. Records that may need manual inspection are flagged.

MARCReady is not magic. If a source export has already destroyed the original characters, human review may still be required. But it can help identify problems early and avoid importing a full catalogue with broken text.