Quarter Life Crisis

The world according to Sven-S. Porst

« June FilmsMainssp vs. MySQL »

MySQL vs. UTF-8

787 words

As promised on Sunday, here come some gory details of me moving the blog database from one server to the other. Both servers are running the same database, MySQL, which wants to be known as ‘The world’s most popular open source database’.

My naïve assumption was the following: Running the same database on both sides of the data migration means that I can just export data on the one side and then import it again on the other side. Unfortunately that didn’t work at all. The text encodings came out wrong in all cases.

It’s hard to tell what exactly went wrong there, MovableType may have set a solid basis for things going wrong back when I first populated the database by upgrading from BerkeleyDB. I doubt I’ll ever know and I’m fairly sure I hardly care. I’m not a database developer, and if I were, I’d be ashamed.

Now, over here at earthlingsoft we are known to be Unicode aficionados, we have seen an encoding problem or two, we tend to recognise these problems and quite often we know an easy way to solve them. But this one turned out to be worse than all the others I had seen before.

The main ingredient to the problem is the UTF-8 encoding. While UTF-8 has the disadvantage of being non-trivial and increasing the byte count of non-Roman texts compared to UTF-16, it has the advantage of being backwards compatible to ASCII and being immune to byte order problems. As soon as you read texts in a language that uses non-ASCII characters (i.e. pretty much any language other than English or Latin), it’s quite likely that you have seen the typical UTF-8 encoding problem on the web already. The problem arises when text is stored in UTF-8 format but later served and thus interpreted in an old-school encoding - most likely ISO-Latin-1 or Windows-CP1252 on the web these days or Mac Roman on Macs.

In practice this problem frequently means that you see à characters amidst junk. For example, what should be an ‘ö’ will appear as ‘ö’. What happens there is that the ‘ö’ is encoded as the byte with (hexadecimal) value C3 followed by the byte with value B6. And in ISO-Latin-1 encoding C3 is decoded as ‘Ã’ while B6 is decoded as ‘¶’. So there you go.

But that’s far from the whole story. It simplifies things by assuming that you misinterpret the bytes you get as ISO-Latin-1. Another popular misinterpretation is Windows-CP1252 encoding. Which is the same as ISO-Latin-1 in many places, but not in all of them. So it’s not always easy to spot in which way things were misinterpreted. Mac Roman encoding may be easier to spot, it has ‘√’ in position C3, so seeing a root sign makes it likely that you’re dealing with Mac Roman. The following table gives a few examples:

ß Ù
Unicode DF D9 20AC 2603
UTF-8 C39F C399 E282AC E29883
ISO-Latin-1 * à à ⬠â
Windows-CP1252 ß Ù € ☃
Mac Roman ß Ù € ☃

Of course there are many more encodings - Eastern European ones, or even Asian ones - which open many more modes of failure. ISO-Latin-1, Windows-CP1252 and Mac Roman just happen to be the ones I saw the most because I am writing in a Roman script.

Now that’s the main mode of thinking. But Sunday’s post mentioned the string ‘√ɬº’ which you cannot fix in that way. In particular as the character it encodes was supposed to be an ‘ü’ (Unicode U+FC in NFC and U+75 U+308 in NFD, none of which yield the characters in question).

However, once you’re in the mood for this kind of thing, you see the √ glyph and think ‘Mac Roman’, so you decode that and get ‘Ã ¼’ which in turn triggers the ‘ISO-Latin-1’ reflex and indeed is a wrong representation of ‘ü’. So there we are: UTF-8 wrongly encoded twice. First as ISO-Latin-1 (or Windows-CP1252), then read again and wrongly encoded once more as Mac Roman.

As soon as I realised that I figured that I better check whether I do any implicit encoding or decoding in the process as I am the Mac user, thus the only one who could have software that considers using Mac Roman encoding a good idea, and both other machines were some flavour of Linux.

As I’ll explain tomorrow that was the case to a certain degre. But unfortunately that wasn’t what broke things and neither could the Mac text editor really be blamed for it. I’ll continue tomorrow, give further details of the problem and sketch the solution I found. It’s not going to be pretty.

July 8, 2008, 0:14

Tagged as cp1252, encoding, latin-1, Mac Roman, mysql, unicode, utf-8.

Comments

Comment by G: User icon

When I had to migrate over a database between servers I discovered this problem. I didn’t bother to fix it, for a couple of reasons. However, some people did tackle it over here. I hope your solution was easier.

July 8, 2008, 8:12

Comment by ssp: User icon

As I didn’t have shell access to the old server, I couldn’t try those options, so my way ended up being a bit more involved (see the following post).

July 9, 2008, 12:38

Add your comment

« June FilmsMainssp vs. MySQL »

Comments on

Photos

Categories

Me

This page

Out & About

pinboard Links

♪♬♪

Received data seems to be invalid. The wanted file does probably not exist or the guys at last.fm changed something.

People

Ego-Linking