Utf8 in web perl application (LAMP) – part 2 – Encode

Wednesday 9 December 2009 @ 6:57 am

The data we get from outside (like STDIN) is usually binary data.
Command like uc, lc wants text strings.

Here is how to change binary data to text strings:

use Encode ;
$foo = Encode::decode_utf8($foo);

you can also convert from other encodings:
$data = decode(“iso-8859-2”, $data);

To check whether a string have utf flag turned on:

use Encode qw(is_utf8);
print is_utf8($foo) ? “utf8” : “not utf8”

If you do not know what encoding is the data, use perl module:
use Encode::Guess;
my $enc = guess_encoding($data, qw/euc-jp shiftjis 7bit-jis/);

You need to tell explicitly which encodings are suspected, because
by default, it checks only ascii, utf8 and UTF-16/32 with BOM.

Encode::Guess->set_suspects(qw/euc-jp shiftjis 7bit-jis/);

But remember, that the guessing is not magic, and it likely to fail.
For example it may fail to recognize whether data are in is-8859-1 or iso-8859-2
Because:

The reason is that Encode::Guess guesses encoding by trial and error. It first splits $data into lines and tries to decode the line for each suspect. It keeps it going until all but one encoding is eliminated out of suspects list. ISO-8859 series is just too successful for most cases (because it fills almost all code points in \x00-\xff).

Tags: ,

Comments (2) - Posted in work by  




Warning: Creating default object from empty value in /home3/lech/public_html/baczynski.com/perl/wp-includes/comment-template.php on line 1056

 2 responses to “Utf8 in web perl application (LAMP) – part 2 – Encode”

Leave a comment