Boneyard Tools

Diacritics, Unicode normalization and transliteration

How NFD and NFC normalization strip accents, why some letters need a manual map, and where transliteration to ASCII helps.

What a diacritic really is

A diacritic is a small mark added to a base letter to change its sound, such as the acute on e in cafe or the umlaut on u in Zurich. On screen the accented letter looks like a single glyph, but Unicode can represent it two ways: as one precomposed code point, or as a base letter followed by a separate combining mark. That dual nature is the key to removing accents cleanly, because once a letter is split apart the mark can simply be deleted.

NFD and NFC normalization

Unicode defines normalization forms that convert between the composed and decomposed views. NFD, the decomposition form, rewrites an accented letter as its base plus one or more combining marks that live in the U+0300 to U+036F range. This tool applies NFD, deletes every character in that combining range, then applies NFC to recompose whatever is left. The result keeps the plain base letters and drops only the accents, which is far safer than trying to match thousands of precomposed characters by hand.

Letters that will not decompose

Normalization does not solve everything, because some letters are not an accented form of an ASCII letter at all. The German sharp s, the ae and oe ligatures, thorn and eth from Old English and Icelandic, and stroked letters like o-slash and l-stroke have no base letter to fall back to. NFD leaves them intact, so the tool carries an explicit map that transliterates each one to a sensible ASCII string, turning the sharp s into ss and thorn into th rather than dropping them.

Where transliteration to ASCII helps

Plain ASCII is still the safest currency for machine-readable text. URL slugs, filenames, email local parts, sort keys and legacy database columns can all misbehave when fed accented bytes, sometimes double-encoding into mojibake. Stripping accents first gives you a clean, predictable string that you can then lowercase and slugify. It also improves matching, so a search for Munchen can find Munchen written with an umlaut once both sides are folded to ASCII.

Frequently asked questions

Is removing accents the same as normalizing to NFC?

No. NFC only reshapes how a character is stored while keeping the accent. Removing accents goes further: it decomposes with NFD, deletes the combining marks, and transliterates special letters, so the accent is gone rather than merely recomposed.

Why not just strip every non-ASCII byte?

Because that deletes information. Blindly dropping non-ASCII bytes turns cafe with an accent into caf and loses the sharp s entirely. Decomposing first keeps the base letter, and the transliteration map preserves letters that have no direct ASCII form.