Unicode-processing issues in Perl and how to cope with it

Perl version 5.6 introduced partial Unicode support. Perl 5.8 has much improved support for it, but still may be very cumbersome to use. I propose solutions to some problems, mostly for Perl 5.8 users.

In Perl 5.8 Unicode support is described in perluniintro, perlunicode, Encode, utf8, mentioned in encoding, -f open manpages (read them with perldoc tool, for instance).

The major problem with this documentation is its volume. Normal programmer won’t read it. (Cool programmers don’t read documentation, as we all know.) Most programmers don’t even need to read it all, because to work with Unicode you just need to know the basic facts and rules.

I somehow got into several different kinds of trouble with Unicode in Perl, both in 5.6 and 5.8, in several different projects. Always it was about processing and generating data in UTF8 encoding.

The two main problems I’ve seen are:

UTF8 data getting double-encoded
Wide character in print warning

Having said the above, reading or at least browsing through the above mentioned manpages is still a good way to understand and solve your Unicode problems. If you don’t have time for that now, read on.

The basic facts you need to know

There is a distiction between bytes and characters.

There is a "utf8" flag on every scalar value, which might be "on" or "off".

(Attention! Here is the source of many many perl-unicode problems:) If you take a string with utf8 flag off and concatenate it with another string with utf8 flag on, perl will convert the first one to utf8.

This may sound ok and obvious. But then you think: how? Perl will need to know the encoding the string is, before converting it, and perl will try to guess it.

The algorithm perl uses in guessing is documented (uses some defaults and maybe checks your locale), but my suggestion is: never let perl do that, unless you have no choice. In my experience, this is the reason for double-encoded utf8 strings and it was the main source of headache.

If you process Unicode data in your scripts, always use utf8 to store and process your data and make sure perl knows you use it. (i.e. make sure all your unicode-strings have "utf8" flag on.)

Encode manpage will tell you about _utf8_on() and _utf8_off() functions, but don’t yet rush to use them.

There are better ways to do that.

How to get utf8 flag "on" on your scalars?

First of all, it is often already there, so you don’t need to worry. For instance, if you read your data in through an XML parser, you may assume that strings coming from it will be in UTF8 and will have utf8 flag on (unless you do something weird, like trying to get it in original form from the parser, which you shouldn’t anyway).

If you read data from a file, there are several different ways to tell perl about it’s encoding.

In perl 5.6 there is a magic pack 'U*', unpack ( 'U*', …) construct, which can help. So if you open a file, which contains utf8 data, you read it into a variable and then you say:


  $data = pack 'U*', unpack( 'U*', $data );

And your $data now has "utf8" flag on.

In perl 5.8, you should use Encode::decode_utf8 in a similar way, but probably the same pack & unpack trick will do the job, I leave it for you to try. I use:


  use Encode;
  $data = Encode::decode_utf8( $data );

(Do not forget though, that not every sequence of bytes is valid UTF8. So this operation may fail. See Encode manpage for error-handling.)

Example

Let’s look at a real-life example. In ACIS we take HTML form input parameters through CGI in UTF8 encoding. We generate HTML in UTF8. To manipulate the user’s input, we need to tell perl that it is in UTF8:


  use Encode;

  my @par_names = $query -> param;

  my $form_input = {};  

  foreach my $name ( @par_names ) {
    my @val = $query -> param( $name );

    foreach ( @val ) {
      $_ = Encode::decode_utf8( $_ );
    }

    if( scalar( @val ) == 1 ) {
      $form_input ->{$_} = $val[0];
    } else {
      $form_input ->{$_} = \@val;
    }
  }

An important thing is that the result of Encode::decode_utf8 doesn’t always have utf8 flag "on" and that is OK. If you decode_utf8 a pure-ASCII string, it won’t have the utf8 flag on. ASCII data is safe with regard to UTF-8 conversions: it doesn’t need any, so it is impossible to screw up.

Wide character in print warning

The warning happens, when you output a Unicode-string (that means, a string with utf8 flag "on" and containing at least one unicode character outside ASCII range) on a non-unicode filehandle.

"What the f*ck 'Non-unicode filehandle?'" you could ask.

Perl 5.8 introduces PerlIO, a new Input/Output subsystem, which has the notion of a filehandle ~~discipline~~ layer. With a filehandle layer you can do on-the-fly transparent encoding conversions, or line-ending conversion.

Say, if you open a file as:


 open FILE, "<:encoding(iso-8859-7)", $filename;

it’s content will be assumed to be in iso-8859-7 encoding. Perl will use that to interprete file’s data correctly. (I.e. to convert it to internal UTF8).

Basically, to get rid of the warning, you have two ways: one is wrong and the other is right. The wrong way is to turn off the utf8 flag on your data. Then the characters will turn into bytes, and it will print out smoothly.

The right way is to tell perl, that what your output is expected to be in UTF8. So, if you print to a file, open the file this way:


 open FILE, ">:utf8", $filename;

If you print to standard output (or standard error), you can do this:


 binmode( STDOUT, ":utf8" );

The Perl’s "There’s more than one way to do it" applies to Unicode support as much as to everything else. So if you do take a good look at the documentation, you’ll see that there are other ways, functions, tricks to fix (or break) your Unicode-aware script.