Unicode horror – nice tool to convert
Wednesday 13 January 2010 @ 6:59 am

Here is a nice tool to convert for example different characters to percent encoding for URIs, 0x notation, decimal code points etc; (by Richard Ishida):

http://rishida.net/tools/conversion/

Ok, it is not strictly related to perl , but I guess it may be handy for people that were interested in the “utf at LAMP horror” cycle.

Comments (0) - Posted in work by  



Utf8 horror at LAMP – accept charset
Wednesday 30 December 2009 @ 6:55 am

Continuing the never ending saga of perl / utf horror:

<form method=”post” accept-charset=”utf-8″ action=”…”>

Well, I never used it… and my web app works.

Is this accept-charset really needed? Do you know?

Comments (2) - Posted in work by  



Utf8 in web perl application (LAMP) – dbi, mysql
Wednesday 23 December 2009 @ 6:59 am

Horror with utf8 and LAMP ( perl ) web application contiunued:

We need to take care of  mysql connection, so it is ut8 – ready:

if (my $dbh = DBI->connect(“DBI:mysql:database=”.$name.’;host=’.$hostname, $user, $password,
{
RaiseError        => $raise,
#             AutoCommit        => 1,
mysql_enable_utf8 => 1,
on_connect_do => [ “SET NAMES ‘utf8′”, “SET CHARACTER SET +’utf8′” ],
})) {

$dbh->{‘mysql_enable_utf8’} = 1;
return $dbh; # DBI database handler
}

We also must make sure that our tables are unicode:

CREATE TABLE `foo` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
….

)  DEFAULT CHARSET=utf8

And remember, that utf8 chars may take more space than normal – so prepare longer varchars in tables and be prepared for problem with indexes, like this: Specified key was too long; max key length is 1000 bytes – see http://bugs.mysql.com/bug.php?id=4541

Comments (0) - Posted in work by  



Utf8 in web perl application (LAMP) – binmode, charset
Wednesday 16 December 2009 @ 6:57 am

After wrestling with perl encoding, we need to make sure the pages of website we create are displayed in utf8.
This means we need to have proper header in pages, for example:

<!DOCTYPE HTML PUBLIC “-//W3C//DTD HTML 4.01 Transitional//EN” “http://www.w3.org/TR/html4/loose.dtd”>
<html>
<head>
<meta http-equiv=”Content-Type” content=”text/html; charset=UTF-8″>



</head>
<body>

and/or:

Content-Type: text/html; charset=utf-8

Second step, is to set binmode on STDOUT (if we print our dynamically generated webpages)

binmode STDOUT, “:utf8”;

to get rid of

Wide character in print at …

warnings.

Comments (2) - Posted in work by  



Utf8 in web perl application (LAMP) – part 2 – Encode
Wednesday 9 December 2009 @ 6:57 am

The data we get from outside (like STDIN) is usually binary data.
Command like uc, lc wants text strings.

Here is how to change binary data to text strings:

use Encode ;
$foo = Encode::decode_utf8($foo);

you can also convert from other encodings:
$data = decode(“iso-8859-2”, $data);

To check whether a string have utf flag turned on:

use Encode qw(is_utf8);
print is_utf8($foo) ? “utf8” : “not utf8”

If you do not know what encoding is the data, use perl module:
use Encode::Guess;
my $enc = guess_encoding($data, qw/euc-jp shiftjis 7bit-jis/);

You need to tell explicitly which encodings are suspected, because
by default, it checks only ascii, utf8 and UTF-16/32 with BOM.

Encode::Guess->set_suspects(qw/euc-jp shiftjis 7bit-jis/);

But remember, that the guessing is not magic, and it likely to fail.
For example it may fail to recognize whether data are in is-8859-1 or iso-8859-2
Because:

The reason is that Encode::Guess guesses encoding by trial and error. It first splits $data into lines and tries to decode the line for each suspect. It keeps it going until all but one encoding is eliminated out of suspects list. ISO-8859 series is just too successful for most cases (because it fills almost all code points in \x00-\xff).

Comments (2) - Posted in work by  



Utf8 in web perl application (LAMP)
Wednesday 2 December 2009 @ 6:45 am

Making a correct utf8 web application in LAMP (Perl) is not easy. There are lots of dangerous traps along the way.

We need to take care of source code encoding, web forms data – inputed by user –  encoding, mysql encoding, displaying data (writing) encoding, and perhaps also take care of data taken from disk files.

First, we need to take care of source code, if we want to write there something like

$var = “zażółć gęślą jaźń”

we need  to use utf8

The use utf8 pragma tells the Perl parser to allow UTF-8 in the program text in the current lexical scope

Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.

use utf8 is not a magical trick to fix all problems with utf8, just a beginning of the journey…

We must have a good code editor or IDE that understands utf8. It will be also nice to have possibility to open files with other other encodings and convert them to utf8. For unix/linux there is for example KDevelop, and many other tools. For windows, there are many editors too. See: http://en.wikipedia.org/wiki/Comparison_of_text_editors#Unicode_and_other_character_encodings and http://www.alanwood.net/unicode/utilities_editors.html

When using use utf8 remember to save code in utf8 encoding!

To be continued…

Comments (2) - Posted in work by