The data we get from outside (like STDIN) is usually binary data.
Command like uc, lc wants text strings.
Here is how to change binary data to text strings:
use Encode ;
$foo = Encode::decode_utf8($foo);
you can also convert from other encodings:
$data = decode(“iso-8859-2″, $data);
To check whether a string have utf flag turned on:
use Encode qw(is_utf8);
print is_utf8($foo) ? “utf8″ : “not utf8″
If you do not know what encoding is the data, use perl module:
use Encode::Guess;
my $enc = guess_encoding($data, qw/euc-jp shiftjis 7bit-jis/);
You need to tell explicitly which encodings are suspected, because
by default, it checks only ascii, utf8 and UTF-16/32 with BOM.
Encode::Guess->set_suspects(qw/euc-jp shiftjis 7bit-jis/);
But remember, that the guessing is not magic, and it likely to fail.
For example it may fail to recognize whether data are in is-8859-1 or iso-8859-2
Because:
The reason is that Encode::Guess guesses encoding by trial and error. It first splits $data into lines and tries to decode the line for each suspect. It keeps it going until all but one encoding is eliminated out of suspects list. ISO-8859 series is just too successful for most cases (because it fills almost all code points in \x00-\xff).
See also:
- Utf8 in web perl application (LAMP)
- Utf8 in web perl application (LAMP) – binmode, charset
- Utf8 in web perl application (LAMP) – dbi, mysql
- Utf8 horror at LAMP – accept charset
- Unicode horror – nice tool to convert


There’s also Encode::Detect on CPAN which uses Mozilla’s universal charset detector. It works by using a character distribution model to guess the encoding.
Thanks, Steve!