UTF8, MySQL, Perl and PHP

In my previous post I mentioned that Haddock Blogs was finally almost free of the character encoding errors that have plagued it for years.

In the interest of helping anyone else doing similar work in the future but also, more selfishly, in the hope peopl spot things I could do better, here’s a summary of what I’m now doing to keep things UTF8 from start to finish.

I’m the first to admit my method has been haphazard here, stumbling from one error to another, Googling for a hasty solution to heave me over and on to the next one. Follow my instructions at your own risk, and if you spot any errors, large or small, do post a comment or email me, as I’d love to know where I’ve gone wrong.

MySQL

First off, make sure the database is using UTF8.

mysql> STATUS;

This command will show you the “Db characterset”. If it’s not “utf8” then:

mysql> ALTER DATABASE database_name CHARACTER SET utf8;

Although the database is now using UTF8 your tables and columns might be set to something else. Check the table’s character set by doing this:

mysql> SHOW CREATE TABLE table_name;

This will show you more information about your table than a simple “DESCRIBE” will. The last part will show the table’s CHARSET. If it’s not “utf8”:

mysql> ALTER TABLE table_name CONVERT TO CHARACTER SET utf8;

I don’t know how to find out the encoding of individual columns, or how likely it is that they could still be set to something else, but this is how to convert them to UTF8 (in this case, a VARCHAR(255) column):

mysql> ALTER TABLE table_name MODIFY column_name VARCHAR(255) CHARACTER SET utf8;

To add to the fun, your connection to the database itself has its own encoding. We’ll see more of this later, but if you’re using MySQL on the command line, entering this command:

mysql> SET NAMES utf8;

will mean that your current session is in UTF8. If you’re using PHPMyAdmin, you might not have to worry about this(?). I expect your command line client also has its own encoding…

Perl

Haddock Blogs uses a Perl script to grab remote RSS/Atom feeds, put items into the database, and generate its own RSS feeds.

First thing is ensuring our connection to the databse is all UTF8 friendly:

# Standard DBI connection.
my $dbh = DBI->connect("DBI:mysql:${database}:${hostname}", $username, $password) 
    or die "DB Connection not made: $DBI::errstr";

$dbh->{'mysql_enable_utf8'} = 1;

$dbh->do('SET NAMES utf8');

The last of those is just as we did in the previous section, ensuring our connection is nicely UTF8. The ‘mysql_enable_utf8’ line is something that took a bit of Googling to find when I was having problems fetching UTF8 data from the database using DBI. This page suggests this is non standard, but all I know is I was relieved when I found something that solved a problem.

For fetching the RSS/Atom feeds I used XML::Feed and blindly trusted that it would do things as right as possible. It seems to have Just Worked without too much fiddling. The text strings it returns get put into the database pretty much as-is, with a minimal amount of modifications (removing HTML tags from brief summaries, stripping spaces at start and end, etc.).

However, I encountered problems when SELECTing existing data from the databse. I check to see if any recent blog posts have changed, compared to those in the database. I found that some posts were always “changed”: the text pulled from the database was different to the supposedly identical text found in the remote RSS feed.

Strings in Perl are internally marked with some kind of encoding information. I nearly understood this stuff a week or two back when I was reading up on it and have forgotten the details already. The solution, for me, was to do this to each of the strings that have been SELECTed from the database, before doing anything else with them:

use utf8;

# Do this for every string you fetch from the database:
utf8::decode( $row{'item_title'} ) unless utf8::is_utf8( $row{'item_title'} );

You can read a bit more about that where I read it. After doing this I was able to compare MySQL-originated strings with XML::Feed-originated strings accurately.

Haddock Blogs’ RSS feeds are created using XML::RSS. At the moment I’m creating the XML::RSS object like this:

my $rss = new XML::RSS (
    'version' => '1.0', 
    'encoding' => 'UTF-8',
    'encode_output' => 0
);

which seems to work OK. I do the same utf8::decode() rigmarole on text fetched from the database before its put into the RSS feed, and also replace all ‘&’s with ‘&’s.

PHP

The Haddock Blogs web pages are generated with PHP. The first thing to do is do the magic Setting of Names again:

$conn = mysql_connect(DB_HOST, DB_USER, DB_PASS);

if (!$conn) {
    print "<p>DB connection attempt failed.</p>";
    exit;
}

if (!mysql_select_db(DB_NAME, $conn)) {
    print "<p>DB select failed</p>";
    exit;
}

mysql_query('SET NAMES utf8', $conn);

The last line is the relevant one here, the previous ones provided just for context.

There’s not much else I’m doing here to ensure UTF8ness. The HTML has the appropriate content-type set:

<meta http-equiv="content-type" content="text/html; charset=utf-8" />

Text is just fetched from the database and printed to screen.

Summary

Right now this seems to have pretty much done the job — last time I checked the RSS feeds were free of encoding errors. There’s still something odd with &#8220; style HTML entities from one particular feed showing up on the web page, and I’m not sure where that problem lies.

Again, if you have suggestions for improvements to this, or pointers to better explanations of similar processes, do let me know. Thanks.

Comments

  • Thanks for the information. Very helpful.

  • Perl 5.8.0-> Whew. Seems like using the ‘mysql_enable_utf8’ (i used inside the connect string) turns ON the utf8 flag so the utf8::decode function is never called because of the “unless is_utf8/valid” syntax or at least on 5.8.0. I was trying to do a “s/(\X);/exists $utf_entity{$1} ? $utf_entity{$1} : $1/eg;” (copied from http://linuxgazette.net/117/tag/4.html (chged \X+ to \X)) and was not working. It started working when I ALWAYS performed a utf8::decode and THEN IT WORKED!. (I assume it was in OCTETS and the utf8::decode converted to string)…. oh, what a pain but appears to be workings…. oh, I was trying to go FROM utf8 TO iso—8859-1 WITHOUT loosing the “does not map to iso-8859-1” chars so I’m having to MANUALLY change all unicode to latin…

  • Herb Chenault’s second/ongoing utf8 saga-> Upon ongoing utf8 research, seems like “mysql_enable_utf8” shouldn’t be doing anything because it is new in DBD::mysql version 4.03/4.04 (it may be in an earlier DBD version however the DBD::mysql notes say to use .03/.04 because it has some utf8 bug fixes that are important). If not using the latest DBD::mysql then seems like the “set names” and “utf8::decode” are the recommended practices.

    Thank you very much for your above research because it seems to be what everyone is using and recommended, at least until you/we upgrade to DBD::mysql 4.03/04… (it is noted that I didn’t use the “is_valid/is_utf8 and just utf8::decoded all….)

  • Thankyou, thankyou, thankyou, thankyou, thankyou, thankyou.

    I have literally spent the last 4 days working on this with a submission date looming ever closer and failure certain until I get it sorted. I found a few discussions on this but the magic “use names utf8” is what I was missing. it’s now working and I don’t have to look for a new job.

    I can’t thank you enough.

  • Nice one, many thanks for this post. I’ve been trying to work out how to get weird characters in people’s Trackmania Nations player names to show on a web page, then after discovering it needed to be done in UTF-8, trying to work that one out :) Cheers.

  • I just want to add to the thanks for putting such a clear explanation online.

    The site in question was this (see URL below) list of University Research Repositories (websites holding research papers), where the names of many Universities around the world did not display at all correctly). The data is collected from other sites on the web.

    http://www.nostuff.org/ircount/

    I first few attempts to find help via Google did not find anything (clealry not doing the right searches!), but today I found your page quickly (google search ‘perl, mysql, utf’).

    As an aside, the thing that confused me was that as well as writing the data to mysql, the perl script also dumped it to a text file for safe keeping, and oddly this also did not store the names correctly, so for a while I presumed the error was with LWP::Simple getting the data from across the web, and not the writing to mysql. Sent me down the wrong path.

    Thanks again

    Chris

  • Expanding the example for inserting strings from multiple feeds (and different encodings) in the database:

    use Encode;
    use Encode::Guess
    my $utf8 = decode("Guess", $data) unless Encode::is_utf8($row{'item_title'});
    # db stuff here
    

    All works fine now :). Thanks a lot for this guide!!!

  • It seems that if your perl module doesn’t understand utf8, the utf8::decode line will be ineffective, and the data comes raw from the DB.

    In this case you may want to apply utf8::encode($myrow{‘item_x’}); to get utf8 strings that you can then print and use.

  • Thanks a lot for sharing this information.
    There was a gap of behavior between redhat and ubuntu - you fixed it :-)
    And happy new year!

  • You very probably shouldn’t use is_utf8 , _utf8_on or _utf8_off at all.
    From: http://perldoc.perl.org/perlunifaq.html#What-is-%22the-UTF8-flag%22?

    Don’t try to find out what the internal encoding for a certain string is, but instead just encode it into the encoding that you want.
    From: http://perldoc.perl.org/perlunifaq.html#I-lost-track;-what-encoding-is-the-internal-format-really?

    So just apply utf8::decode unconditionally (no is_utf8()).

  • How to see character set for individual colums;
    When I run
    mysql> SHOW CREATE TABLE table_name;
    It returns information on the character set for the individual columns, as in:

    `gene1` varchar(1) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
    `gene2` varchar(1) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,

    This is on an Ubuntu desktop.

25 Apr 2008 at Twitter

  • 07:03pm: Stage Combat done. Waiting to meet Mary and friends for dinner in the City. Drunk men singing Spandau Ballet's 'Gold'.

25 Apr 2008 in Writing

Haddock Blogs changes
A summary of the recent changes to the Haddock Blogs site and feeds.

On this day I was reading