If you seriously think that Perl5 is less suitable for text processing than Perl...

ptx · on Jan 18, 2015

From a quick skimming through the docs[0], Perl 5.8's Unicode support sounds a lot like Python 2's except that the default encoding is latin1 instead of ASCII - i.e. unlike the Python 3 way of explicitly decoding binary data into Unicode text at the point where it's read (where presumably the encoding of the data is known), it will defer the decoding until the data is used (at some completely unrelated point in the program) and decode it with some assumed globally agreed upon encoding. Since there is no globally agreed encoding (Windows even has different legacy encodings between Win32 and the command prompt!) this will appear to work as long as the data is ASCII but later, when you least expect it (and someone inputs a non-ASCII string), give UnicodeDecodeError in Python 2 or garbage data in Perl 5.8.

Ned Batchelder gave an excellent talk[1] that explains how the Python 3 approach to Unicode works. I think it makes a lot more sense once you understand it; the Python 2 way was clearly broken, and it looks like Perl 5 has the same problem but hides it better.

[0] http://search.cpan.org/~jhi/perl-5.8.0/pod/perluniintro.pod

[1] http://m.youtube.com/watch?v=sgHbC6udIqc

b2gills · on Jan 19, 2015

Why are you pointing to the unicode entry doc in an ancient version of Perl?

Also according to perlunifaq the minumum version you should be using is 5.8.1 which the documents for 5.8.0 would of course not mention.

Really if you want good unicode handling you should probably use 5.16.0 or later. If you want the latest version of Unicode there are ways of changing which version of Unicode Perl is compiled with, but it is easier to just use the latest version of Perl, which is 5.20.

http://perldoc.perl.org/perluniintro.html http://perldoc.perl.org/perlunitut.html http://perldoc.perl.org/perlunifaq.html http://perldoc.perl.org/perlunicode.html

p.s. I noticed in the Python talk you linked that no one knew that the pile of poo symbol is in there because the japanese characters for luck and poo are very similar. ( I am unable to find a link to where I first read this ) The Japanese are also responsible for why we call them emoji (e means image, and moji means character. ) http://www.fastcompany.com/3037803/the-oral-history-of-the-p...

duskwuff · on Jan 19, 2015

The documentation you're looking at is quite old; I'd recommend looking at the current version for a better view. (The implementation is largely the same; the documentation has just improved quite a bit since then.)

http://perldoc.perl.org/perluniintro.html

Anyways, Perl5's Unicode support is quite different from Python's.

Specifically, it doesn't have distinct "Unicode string" and "byte string" types; instead, it has a single unified string type. These strings may be internally stored as Latin1 or UTF-8, depending on how they were created, but they behave identically for almost[1] all purposes, and there are easy ways to force Perl to convert between the formats. It's still possible to create a nonsensical string if you do something silly like append Unicode characters to a string containing raw UTF-8 data, but that's not something the language can entirely protect you from.

[1]: The only exceptions I'm aware of are functions which explicitly operate on the utf8 status, like utf8::is_utf8(), and bitwise operations like &|^ and vec().