How to replace accented characters with their respective non-accented counterparts

… with Unicode and Perl 5.8. Last update: 2004-10-24

Let’s assume you have some ISO-8859-1 (Latin-1) data. Naturally, this data has some accented characters, like, for instance, “é”. You need to convert this data to plain ASCII, but still have it readable. I’ll give you an advice on how to do that.

First, it will be useful to understand certain things about Unicode. Then I’ll show you what to do, exactly.

Virtues of Unicode

Unicode, first of all, is a database of characters. Its primary virtue is that it assigns codes to almost all existing letters and punctuation characters of almost all world’s languages. It provides a name to each character.

Selected Unicode character codepoints and their names
codepoint	character name	the character
U+005D	RIGHT SQUARE BRACKET	]
U+00E3	LATIN SMALL LETTER A WITH TILDE	ã
U+0F3D	TIBETAN MARK ANG KHANG GYAS	༽
U+1FBF	GREEK PSILI	᾿

But it is not even a half of the story. Unicode also carries hell of a lot of other information about the characters. Some of which is very useful, sometimes.

Combining sequence

The same human text in the same encoding may have several Unicode representations. To be precise, there might be several ways to write certain characters in Unicode. The funny thing is: the accented characters are such characters. And I’m not talking about encodings here.

Let’s make this clear. Unicode has separate characters for special marks, which can be combined with other characters to form a new one. For instance, there is a separate character for the acute accent mark. When you need to write “é”, you can write “e” (U+0065) and add the combining acute mark to it (U+0301). This means: put two Unicode characters, one after another: U+0065 U+0301. If I understand it right, it is an example of what is called “combining sequence”.

Or, you can write the Unicode character U+00E9 directly, which is LATIN SMALL LETTER E WITH ACUTE. Both ways will be equivalent in terms of text produced, and every Unicode-compatible application shall process (e.g. display) it equally well.

Not all possible combining sequences have single-character equivalents. But all those accented characters used in ISO-8895-1 encoding do have two representations.

Unicode decomposition

Unicode defines a precise way to translate between combining sequences (U+0065 U+0301) and their single character equivalents (U+00E9). Unicode defines backwards translation as well. Translating single accented characters into corresponding combining sequences is decomposition. That’s the next big thing for our task.

Decomposition would break each “é” into “e” and the acute accent mark “´”.

Outline

So, here is the outline of our solution:

we take some data with accented characters;
convert it to Unicode;
put it through Canonical Decomposition, also known as Normalization Form D;
remove all characters that belong to the Unicode General Category “Mark” (non-spacing, spacing combining, enclosing) — thus removing the accent marks;
prepare the data for output to an ASCII stream.

I had to tell you all the story about Unicode. Because otherwise you won’t understand this outline. Otherwise, ask me questions, so that I could improve the text above.

Practice

We will need Perl 5.8 with its Encode module, and an add-on module Unicode::Normalize, which you get from CPAN.

Unicode::Normalize search on search.cpan.org

The code:

 require Encode;
 use Unicode::Normalize;

 for ( $str ) {
   ##  convert to Unicode, if your data is originally in
   ##  Latin-1:
   $_ = Encode::decode( 'iso-8859-1', $_ ); 
   $_ = NFD( $_ );   ##  decompose
   s/\pM//g;         ##  strip combining characters
   s/[^\0-\x80]//g;  ##  clear everything else
 }

Nothing is perfect

Problem: not all “funny” characters ISO-8895-1 are decomposable into a base character and a combining sequence. Here are some of those: “ß”, “Ø”, “œ”. These characters would disappear after the above code, unless we take measures.

In fact, there’s a lot of such characters. Some of them tranliterate well into a pair, like “ä” -> “ae”. Some other — into a single simple character, like “Ø” -> “O”.

Here is my code complete with all those additional transliterations:

 require Encode;
 use Unicode::Normalize;

 for ( $str ) {
   ##  convert to Unicode first
   ##  if your data comes in Latin-1, then uncomment:
   #$_ = Encode::decode( 'iso-8859-1', $_ );  

   s/\228/ae/g;  ##  treat characters ä ñ ö ü ÿ
   s/\241/ny/g;  
   s/\246/oe/g;
   s/\252/ue/g;
   s/\255/yu/g;

   $_ = NFD( $_ );   ##  decompose (Unicode Normalization Form D)
   s/\pM//g;         ##  strip combining characters

   s/\x{00df}/ss/g;  ##  German beta “ß” -> “ss”
   s/\x{00c6}/AE/g;  ##  Æ
   s/\x{00e6}/ae/g;  ##  æ
   s/\x{0132}/IJ/g;  ##  Ĳ
   s/\x{0133}/ij/g;  ##  ĳ
   s/\x{0152}/Oe/g;  ##  Œ
   s/\x{0153}/oe/g;  ##  œ

   tr/\x{00d0}\x{0110}\x{00f0}\x{0111}\x{0126}\x{0127}/DDddHh/; # ÐĐðđĦħ
   tr/\x{0131}\x{0138}\x{013f}\x{0141}\x{0140}\x{0142}/ikLLll/; # ıĸĿŁŀł
   tr/\x{014a}\x{0149}\x{014b}\x{00d8}\x{00f8}\x{017f}/NnnOos/; # ŊŉŋØøſ
   tr/\x{00de}\x{0166}\x{00fe}\x{0167}/TTtt/;                   # ÞŦþŧ

   s/[^\0-\x80]//g;  ##  clear everything else
 }

When dealing with european data this will give real good results. The code is in public domain. Don’t forget it requires Perl 5.8. To compile the above special character translations I used that page. If something is wrong or incomplete, please let me know.

P.S.

Having said all the above, I suggest to return back to the original problem you had and think again about it. Is it a right way?

These days basic support for Unicode is not a miracle anymore. Operating systems, web browsers, fonts — many are already Unicode-enabled. Potentially, this could mean that transformation of data to plain ASCII is an unnecessary loss of information. Information of cultural value, by the way.