Utf8 in web perl application (LAMP) – part 2 – Encode « Perl
Utf8 in web perl application (LAMP) – part 2 – Encode
Wednesday 9 December 2009 @ 6:57 am

If you're new here, you may want to subscribe to my RSS feed. Thanks for visiting!

The data we get from outside (like STDIN) is usually binary data.
Command like uc, lc wants text strings.

Here is how to change binary data to text strings:

use Encode ;
$foo = Encode::decode_utf8($foo);

you can also convert from other encodings:
$data = decode(“iso-8859-2″, $data);

To check whether a string have utf flag turned on:

use Encode qw(is_utf8);
print is_utf8($foo) ? “utf8″ : “not utf8″

If you do not know what encoding is the data, use perl module:
use Encode::Guess;
my $enc = guess_encoding($data, qw/euc-jp shiftjis 7bit-jis/);

You need to tell explicitly which encodings are suspected, because
by default, it checks only ascii, utf8 and UTF-16/32 with BOM.

Encode::Guess->set_suspects(qw/euc-jp shiftjis 7bit-jis/);

But remember, that the guessing is not magic, and it likely to fail.
For example it may fail to recognize whether data are in is-8859-1 or iso-8859-2
Because:

The reason is that Encode::Guess guesses encoding by trial and error. It first splits $data into lines and tries to decode the line for each suspect. It keeps it going until all but one encoding is eliminated out of suspects list. ISO-8859 series is just too successful for most cases (because it fills almost all code points in \x00-\xff).

Share and Enjoy:
  • Print
  • Digg
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • email
  • LinkedIn
  • MySpace
  • Reddit
  • RSS
  • Slashdot
  • StumbleUpon
  • Suggest to Techmeme via Twitter
  • Technorati
  • Twitter
  • Twitthis
  • Yahoo! Bookmarks
  • Yahoo! Buzz

See also:

  1. Utf8 in web perl application (LAMP)
  2. Utf8 in web perl application (LAMP) – binmode, charset
  3. Utf8 in web perl application (LAMP) – dbi, mysql
  4. Utf8 horror at LAMP – accept charset
  5. Unicode horror – nice tool to convert

Tags: ,

Comments (2) - Posted in work by Lech  



 2 responses to “Utf8 in web perl application (LAMP) – part 2 – Encode”

  •   Steve Sabljak wrote:

    There’s also Encode::Detect on CPAN which uses Mozilla’s universal charset detector. It works by using a character distribution model to guess the encoding.

  •   admin wrote:

    Thanks, Steve!

Leave a comment