With some help from Scott (thanks!) who actually used things like MySQL before I did the scary step of moving this site from Latin-1 encoding to UTF-8. It was about time!
When I first set up Movable Type some years back I didn’t worry about the encoding setup too much. I think it defaulted to Latin-1 and that’s what I went with. While I would have preferred some Unicode encoding at that stage, it wasn’t clear to me whether the support for it was good enough in all browsers and I simply converted everything non-ASCII I typed to HTML character entities before posting to be on the safe side. Thanks to UnicodeChecker’s conversion services that’s just two keystrokes away at any time.
Some years passed, I became rather confident about using Unicode encodings everywhere. There started being comments on the site, coming from people who didn’t (and shouldn’t be expected to) convert what they typed to ASCII HTML before posting. This meant that comments using characters outside the Latin-1 range have always been somewhat problematic. But the problem only rarely became visible.
More time passed and I eventually switched from Berkeley DB to MySQL for storing the site data. I actually hoped I could ‘upgrade’ the encoding in that process. But that option doesn’t seem to exist at all in Movable Type and the whole conversion was painful enough as it was, so I ended up having a shiny new database with text in Latin-1 encoding to go with a smelly new version of Movable Type which defaulted to use UTF-8 encoding. So I grudgingly had to re-adjust that to use Latin-1.
Things became even worse when I added the delicious feed into the right sidebar. As delicious serves text in UTF-8 encoding and I didn’t know a simple way to convert that to Latin-1 on-the-fly, my pages would frequently not validate for encoding reasons alone and – even worse – the delicious list would sport words that look broken.
So I really wanted to switch to use UTF-8 everywhere so I could forget about those problems. As I know pretty much nothing about MySQL and am somewhat scared by the fact that all my data live on the server with me probably having an easy time breaking everything and a hard time fixing it again, I didn’t feel like experimenting and asked for help. Scott was good enough to reply quickly and let me know that I’d have to change the mystical ‘collation’ settings for the database and convert the data.
Not exactly what I wanted to hear (I had hoped for some magic hidden button in the ugly phpMyAdmin site or so magic SQL command that’d just sort everything out for me) but at least a well-defined plan. Changing the collation settings was easy. So all that remained to to was to convert the text. As I wouldn’t know how to write a program that traverses all tables of the database and converts the encoding automatically and I figured that most of my text fields will be ASCII only anyway, I decided to do the conversion manually, as that’d probably be quicker than finding out how to write such a program, write it and have it work without destroying my database.
Of course the number of text fields containing non-ASCII characters was a bit higher than I anticipated (a few hundred). Thanks to my own sloppiness when editing stuff and forgetting to use UnicodeChecker’s conversion, thanks to people typing ‘smart’ quotation marks and thanks to people with non-ASCII characters in their names leaving comments. But I just went through all those, simply converting them to HTML character entities with UnicodeChecker as well. Mind numbing, for sure, but simple.
The most tricky thing about this step was actually finding the affected records. After a bit of looking I figured I’ll have to use regular expressions for this one and found that the trick is done by this cryptic mess:
The site is served in UTF-8 now, it validates again and doesn’t look broken. In case you run into places where encoding problems are apparent, please let me know.
Received data seems to be invalid. The wanted file does probably not exist or the guys at last.fm changed something.