Проект ACIS: содержание, old manual.
Perl version 5.6 introduced partial Unicode support. Perl 5.8 has much improved support for it, but still may be very cumbersome to use. I propose solutions to some problems, mostly for Perl 5.8 users.
In Perl 5.8 Unicode support is described in
perluniintro
, perlunicode
,
Encode
, utf8
, mentioned in
encoding
, -f open
manpages (read them with
perldoc
tool, for instance).
The major problem with this documentation is its volume. Normal programmer won’t read it. (Cool programmers don’t read documentation, as we all know.) Most programmers don’t even need to read it all, because to work with Unicode you just need to know the basic facts and rules.
I somehow got into several different kinds of trouble with Unicode in Perl, both in 5.6 and 5.8, in several different projects. Always it was about processing and generating data in UTF8 encoding.
The two main problems I’ve seen are:
Wide character in printwarning
Having said the above, reading or at least browsing through the above mentioned manpages is still a good way to understand and solve your Unicode problems. If you don’t have time for that now, read on.
There is a distiction between bytes and characters.
There is a "utf8" flag on every scalar value, which might be "on" or "off".
(Attention! Here is the source of many many perl-unicode problems:) If you take a string with utf8 flag off and concatenate it with another string with utf8 flag on, perl will convert the first one to utf8.
This may sound ok and obvious. But then you think: how? Perl will need to know the encoding the string is, before converting it, and perl will try to guess it.
The algorithm perl uses in guessing is documented (uses some defaults and maybe checks your locale), but my suggestion is: never let perl do that, unless you have no choice. In my experience, this is the reason for double-encoded utf8 strings and it was the main source of headache.
If you process Unicode data in your scripts, always use utf8 to store and process your data and make sure perl knows you use it. (i.e. make sure all your unicode-strings have "utf8" flag on.)
Encode
manpage will tell you about
_utf8_on()
and _utf8_off()
functions, but
don’t yet rush to use them.
There are better ways to do that.
First of all, it is often already there, so you don’t need to worry. For instance, if you read your data in through an XML parser, you may assume that strings coming from it will be in UTF8 and will have utf8 flag on (unless you do something weird, like trying to get it in original form from the parser, which you shouldn’t anyway).
If you read data from a file, there are several different ways to tell perl about it’s encoding.
In perl 5.6 there is a magic pack 'U*', unpack ( 'U*',
…)
construct, which can help. So if you open a file, which
contains utf8 data, you read it into a variable and then you say:
$data = pack 'U*', unpack( 'U*', $data );
And your $data
now has "utf8" flag on.
In perl 5.8, you should use Encode::decode_utf8
in a
similar way, but probably the same pack & unpack trick will do the
job, I leave it for you to try. I use:
use Encode; $data = Encode::decode_utf8( $data );
(Do not forget though, that not every sequence of bytes is valid
UTF8. So this operation may fail. See Encode
manpage
for error-handling.)
Let’s look at a real-life example. In ACIS we take HTML form input parameters through CGI in UTF8 encoding. We generate HTML in UTF8. To manipulate the user’s input, we need to tell perl that it is in UTF8:
use Encode; my @par_names = $query -> param; my $form_input = {}; foreach my $name ( @par_names ) { my @val = $query -> param( $name ); foreach ( @val ) { $_ = Encode::decode_utf8( $_ ); } if( scalar( @val ) == 1 ) { $form_input ->{$_} = $val[0]; } else { $form_input ->{$_} = \@val; } }
An important thing is that the result of
Encode::decode_utf8
doesn’t always have utf8 flag "on"
and that is OK. If you decode_utf8
a pure-ASCII
string, it won’t have the utf8 flag on. ASCII data is safe with
regard to UTF-8 conversions: it doesn’t need any, so it is impossible
to screw up.
The warning happens, when you output a Unicode-string (that means, a string with utf8 flag "on" and containing at least one unicode character outside ASCII range) on a non-unicode filehandle.
"What the f*ck 'Non-unicode filehandle?'" you could ask.
Perl 5.8 introduces PerlIO, a new Input/Output subsystem, which
has the notion of a filehandle discipline
layer. With a filehandle layer you can do on-the-fly
transparent encoding conversions, or line-ending conversion.
Say, if you open a file as:
open FILE, "<:encoding(iso-8859-7)", $filename;
it’s content will be assumed to be in iso-8859-7 encoding. Perl will use that to interprete file’s data correctly. (I.e. to convert it to internal UTF8).
Basically, to get rid of the warning, you have two ways: one is wrong and the other is right. The wrong way is to turn off the utf8 flag on your data. Then the characters will turn into bytes, and it will print out smoothly.
The right way is to tell perl, that what your output is expected to be in UTF8. So, if you print to a file, open the file this way:
open FILE, ">:utf8", $filename;
If you print to standard output (or standard error), you can do this:
binmode( STDOUT, ":utf8" );
The Perl’s "There’s more than one way to do it" applies to Unicode support as much as to everything else. So if you do take a good look at the documentation, you’ll see that there are other ways, functions, tricks to fix (or break) your Unicode-aware script.