Utf8 in web perl application (LAMP) – part 2 – Encode « Perl
Utf8 in web perl application (LAMP) – part 2 – Encode
Wednesday 9 December 2009 @ 6:57 am

The data we get from outside (like STDIN) is usually binary data.
Command like uc, lc wants text strings.

Here is how to change binary data to text strings:

use Encode ;
$foo = Encode::decode_utf8($foo);

you can also convert from other encodings:
$data = decode(“iso-8859-2″, $data);

To check whether a string have utf flag turned on:

use Encode qw(is_utf8);
print is_utf8($foo) ? “utf8″ : “not utf8″

If you do not know what encoding is the data, use perl module:
use Encode::Guess;
my $enc = guess_encoding($data, qw/euc-jp shiftjis 7bit-jis/);

You need to tell explicitly which encodings are suspected, because
by default, it checks only ascii, utf8 and UTF-16/32 with BOM.

Encode::Guess->set_suspects(qw/euc-jp shiftjis 7bit-jis/);

But remember, that the guessing is not magic, and it likely to fail.
For example it may fail to recognize whether data are in is-8859-1 or iso-8859-2
Because:

The reason is that Encode::Guess guesses encoding by trial and error. It first splits $data into lines and tries to decode the line for each suspect. It keeps it going until all but one encoding is eliminated out of suspects list. ISO-8859 series is just too successful for most cases (because it fills almost all code points in \x00-\xff).

Share and Enjoy:
  • Print
  • Digg
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • email
  • LinkedIn
  • MySpace
  • Reddit
  • RSS
  • Slashdot
  • StumbleUpon
  • Suggest to Techmeme via Twitter
  • Technorati
  • Twitter
  • Twitthis
  • Yahoo! Bookmarks
  • Yahoo! Buzz

See also:

  1. Utf8 in web perl application (LAMP)
  2. Utf8 in web perl application (LAMP) – binmode, charset
  3. Utf8 in web perl application (LAMP) – dbi, mysql
  4. Utf8 horror at LAMP – accept charset
  5. Unicode horror – nice tool to convert

Tags: ,

Comments (2) - Posted in work by  



 2 responses to “Utf8 in web perl application (LAMP) – part 2 – Encode”

  •   Steve Sabljak wrote:

    There’s also Encode::Detect on CPAN which uses Mozilla’s universal charset detector. It works by using a character distribution model to guess the encoding.

  •   admin wrote:

    Thanks, Steve!

Leave a comment