Проект ACIS: содержание, old manual.
… with Unicode and Perl 5.8. Last update: 2004-10-24
Let’s assume you have some ISO-8859-1 (Latin-1) data. Naturally, this data has some accented characters, like, for instance, “é”. You need to convert this data to plain ASCII, but still have it readable. I’ll give you an advice on how to do that.
First, it will be useful to understand certain things about Unicode. Then I’ll show you what to do, exactly.
Unicode, first of all, is a database of characters. Its primary virtue is that it assigns codes to almost all existing letters and punctuation characters of almost all world’s languages. It provides a name to each character.
codepoint | character name | the character |
---|---|---|
U+005D | RIGHT SQUARE BRACKET | ] |
U+00E3 | LATIN SMALL LETTER A WITH TILDE | ã |
U+0F3D | TIBETAN MARK ANG KHANG GYAS | ༽ |
U+1FBF | GREEK PSILI | ᾿ |
But it is not even a half of the story. Unicode also carries hell of a lot of other information about the characters. Some of which is very useful, sometimes.
The same human text in the same encoding may have several Unicode representations. To be precise, there might be several ways to write certain characters in Unicode. The funny thing is: the accented characters are such characters. And I’m not talking about encodings here.
Let’s make this clear. Unicode has separate characters for special marks, which can be combined with other characters to form a new one. For instance, there is a separate character for the acute accent mark. When you need to write “é”, you can write “e” (U+0065) and add the combining acute mark to it (U+0301). This means: put two Unicode characters, one after another: U+0065 U+0301. If I understand it right, it is an example of what is called “combining sequence”.
Or, you can write the Unicode character U+00E9 directly, which is LATIN SMALL LETTER E WITH ACUTE. Both ways will be equivalent in terms of text produced, and every Unicode-compatible application shall process (e.g. display) it equally well.
Not all possible combining sequences have single-character equivalents. But all those accented characters used in ISO-8895-1 encoding do have two representations.
Unicode defines a precise way to translate between combining sequences (U+0065 U+0301) and their single character equivalents (U+00E9). Unicode defines backwards translation as well. Translating single accented characters into corresponding combining sequences is decomposition. That’s the next big thing for our task.
Decomposition would break each “é” into “e” and the acute accent mark “´”.So, here is the outline of our solution:
I had to tell you all the story about Unicode. Because otherwise you won’t understand this outline. Otherwise, ask me questions, so that I could improve the text above.
We will need Perl 5.8 with its Encode module, and an add-on module Unicode::Normalize, which you get from CPAN.
The code:
require Encode; use Unicode::Normalize; for ( $str ) { ## convert to Unicode, if your data is originally in ## Latin-1: $_ = Encode::decode( 'iso-8859-1', $_ ); $_ = NFD( $_ ); ## decompose s/\pM//g; ## strip combining characters s/[^\0-\x80]//g; ## clear everything else }
Problem: not all “funny” characters ISO-8895-1 are decomposable into a base character and a combining sequence. Here are some of those: “ß”, “Ø”, “œ”. These characters would disappear after the above code, unless we take measures.
In fact, there’s a lot of such characters. Some of them tranliterate well into a pair, like “ä” -> “ae”. Some other — into a single simple character, like “Ø” -> “O”.
Here is my code complete with all those additional transliterations:
require Encode; use Unicode::Normalize; for ( $str ) { ## convert to Unicode first ## if your data comes in Latin-1, then uncomment: #$_ = Encode::decode( 'iso-8859-1', $_ ); s/\228/ae/g; ## treat characters ä ñ ö ü ÿ s/\241/ny/g; s/\246/oe/g; s/\252/ue/g; s/\255/yu/g; $_ = NFD( $_ ); ## decompose (Unicode Normalization Form D) s/\pM//g; ## strip combining characters s/\x{00df}/ss/g; ## German beta “ß” -> “ss” s/\x{00c6}/AE/g; ## Æ s/\x{00e6}/ae/g; ## æ s/\x{0132}/IJ/g; ## IJ s/\x{0133}/ij/g; ## ij s/\x{0152}/Oe/g; ## Œ s/\x{0153}/oe/g; ## œ tr/\x{00d0}\x{0110}\x{00f0}\x{0111}\x{0126}\x{0127}/DDddHh/; # ÐĐðđĦħ tr/\x{0131}\x{0138}\x{013f}\x{0141}\x{0140}\x{0142}/ikLLll/; # ıĸĿŁŀł tr/\x{014a}\x{0149}\x{014b}\x{00d8}\x{00f8}\x{017f}/NnnOos/; # ŊʼnŋØøſ tr/\x{00de}\x{0166}\x{00fe}\x{0167}/TTtt/; # ÞŦþŧ s/[^\0-\x80]//g; ## clear everything else }
When dealing with european data this will give real good results. The code is in public domain. Don’t forget it requires Perl 5.8. To compile the above special character translations I used that page. If something is wrong or incomplete, please let me know.
Having said all the above, I suggest to return back to the original problem you had and think again about it. Is it a right way?
These days basic support for Unicode is not a miracle anymore. Operating systems, web browsers, fonts — many are already Unicode-enabled. Potentially, this could mean that transformation of data to plain ASCII is an unnecessary loss of information. Information of cultural value, by the way.
You may find useful my previous essay on dealing with Unicode in Perl.
Critique welcome.