My unsophisticated Perl cribsheet
By Paul Flo Williams
For donkey’s years I have been developing web applications with Apache httpd, Perl, CGI and MySQL, because that has always been the default setup on my web host. I know I should be moving away from nearly all of these, with the exception of Perl, but that would involve me doing something funky with a new server environment, containers, droplets or, sigh, anything that gets kicked into next year’s resolutions.1
Unfortunately, since 1999 I have been tripped up every few years with a new release of one of the above layers that actually improves their Unicode support (yay!), while triggering problems somewhere else (boo.) I still remember how scared I was when I found out that some strings are internally Latin-1 and some are Unicode and thinking that I needed to mess with internal flags to manipulate them. So very glad to have been proven wrong, there.2
Long story short, for my own benefit (and before I trigger another layer collapse), this is how I am currently tackling Unicode from bottom to top of my web environment, because I want poo emojis in my database as much as anyone else. This reflects my understanding of the setup that is currently working for me.
Perl
Since my scripts are written in Perl, this is the major part. I need all these parts to support Unicode:
-
Command line arguments. I like testing web scripts from the command line, particularly because I like to see a JSON output for the big hashes that I normally pump through Template Toolkit in order to produce a web page.
-
Database connection.
-
Web response, which really means getting the correct encoding on stdout.
-
JSON output. At the moment, I use this for testing but I do hope to get more sophisticated over time and have more AJAX-y pages or a working API for my applications.
The first lines of my Perl scripts are:
#!/usr/bin/perl -CAS
use v5.34;
use utf8;
Taken in order:
-
perl -CAS
says that the default file handles (input, output, error) are UTF-8 encoded. Command line arguments are also UTF-8 encoded. Essentially, all my strings in Perl will contain Unicode characters, not octets, and serialising to/from UTF-8 happens at my interfaces. With these options, I no longer have to putbinmode STDOUT, ':utf'
in my scripts. -
use v5.34;
just keeps me up-to-date with the latest features I can use on my webhost, which allows me to saysay
. I no longer have to sayuse strict
, but I’d still needuse warnings
until I get to v5.36. -
use utf8;
allows me to put Unicode characters directly in my Perl script, and that is all it does. I like to do this directly, only resorting to\N{...}
when I can’t clearly see what the character is meant to be, which in my monospaced Vim environment, means long dashes. I’ve never actually used a poo emoji in a script, though I’ve undoubtedly written many a program which could be judged that way.
CGI
Although I still use CGI.pm, it now only gets used for retrieving parameters and setting the content type of the response. I can either do:
use CGI ();
use Encode;
my $cgi = CGI->new;
my $p = decode('UTF-8', $cgi->param('q'));
to decode parameters myself, or
use CGI qw(-utf8);
my $cgi = CGI->new;
my $p = $cgi->param('q');
Clearly the latter is simpler. Either method does the appropriate deserialisation for me so that, again, I’m handling Unicode characters internally, not octets. As a bonus, for debug, I can just output things to stderr or stdout (where it doesn’t interfere with the web response), and the UTF-8 serialisation will happen for me.
MySQL
My web host (Pair Networks) uses MySQL 8.0, so my database text columns are created with a charset of utf8mb4 and a collation of utf8mb4_general_ci. Oddly enough, until two minutes ago, when I checked, I believed they were still using version 5.7. Clearly I’m not sophisticated enough to have tripped over problems during their transition.
My connection line looks like this:
my $dbh = DBI->connect($source, $user, $pass,
{ mysql_enable_utf8mb4 => 1 });
and that performs all the (de)serialisation I need. That MySQL option used to be
mysql_enable_utf8
, before characters got larger.
JSON output
JSON output is as simple as:
use JSON ();
my $json = JSON->new;
my $json_text = $json->encode($r);
The thing that tripped me up most recently was using JSON’s encode_json routine, without noticing that that does the UTF-8 serialisation, which resulted in me double-encoding the output. I find that I have to read documentation very carefully in order to distinguish between interfaces (functions) that consume or produce UTF-8 output versus Unicode output. I want Unicode internally, so that counting or splitting works as expected.
That’s a wrap
Now I’ve written it down and tested it, it looks simple again, but I was worried that I’d done something tragically wrong when I picked the wrong JSON routine and convinced myself that that was the correct part, rather than the rest of the components that had been working correctly up to that point. Sometimes I’m unsure enough of my understanding that I presume I’m more likely to have got two wrongs making a right than the clean stack.
-
Moving to Dancer2 is my top resolution for 2025, but I really need someone else to write that blog post that says “Get an account with so-and-so, run this super deployment script and copy your application here and Bob’s your uncle, and it’ll cost you 37p per month.” ↩︎
-
With the maturing of the ‘unicode_strings’ feature, this too became an historic worry. ↩︎