Quarter Life Crisis

The world according to Sven-S. Porst

« MySQL vs. UTF-8MainMovable Type »

ssp vs. MySQL

1156 words

Yesterday I outlined the main problem I had when moving my MySQL database from one machine to another in a file exported by the database itself: The text encodings were completely broken. The file I saw started as text encoded in UTF-8 encoding, was then misinterpreted as ISO-Latin-1 encoding and then written as UTF-8 encoding again.

When I opened the file again it was misinterpreted as a file in Mac Roman encoding, meaning that I ended up getting four odd characters for something as harmless as an ‘ö’. Now let’s look at this problem from the back. Why the heck was the file opened as Mac Roman rather than UTF-8 when I dragged it to a text editor like TextEdit or SubEthaEdit? Explicitly using the ‘Open…’ command in TextEdit and selecting UTF-8 encoding for the file gave the answer: The file wasn’t in UTF-8 format. And hence, when opening the file without explicit specification of the encoding the text editor took the next best guess… which on the Mac appears to Mac Roman.

That explained what I saw. A line-by-line analysis of the file gave that just a single line in the file caused that problem (fields about plugins), so I figured that wouldn’t matter much anyway. Two resolutions seemed workable for this layer of the problem: (a) read the file in an accepting encoding like MacRoman, split it into lines, drop the line whose bytes aren’t UTF-8, save the file, re-open it as UTF-8. Another way was to import the database dump into the other MySQL server, setting up an old encoding while doing that and then exporting the file as UTF-8 again.

With the first layer of the problem resolved, the remaining one could be tackled. It seemed reasonable to hope that one could just repeat these steps and be happy ever after. Unfortunately we were far beyond happy (or reasonable) at this stage and that just didn’t work. Hence an even closer look was needed. After a bit of trying it appeared as if some of the wrongly encoded characters were using Windows-CP1252 encoding while others were using ISO-Latin-1 encoding.

That doesn’t make a lot of sense and I didn’t check the details to which extent this could be explained by the undefined areas of the respective encodings and/or quirks in Mac OS X’s text encodings - even I have to pick my battles sometimes.

In case you run into this kind of problem, the following code (X.5 Foundation) may give you a head start once you have a UTF-8 file. I’m pretty sure that it may require modifications for other encoding problems but it helped me a lot. It goes through the text character by character, spots the non-ASCII characters, checks whether they might have been mis-encoded with ISO-Latin-1 or Windows-CP1252 encoding and then heuristically creates an output file with what appears to be the best fixed version we could figure out. Each character and its conversions are logged, so you have an idea about which places you have to look at.

#import <Foundation/Foundation.h>


int main (int argc, const char * argv[]) {

NSData * d = [NSData dataWithContentsOfFile:@"/Users/ssp/Desktop/web28db1.sql"];


NSString * s = [[NSString alloc] initWithData:d encoding:NSUTF8StringEncoding];

NSArray * a = [s componentsSeparatedByString:@"\n"];

NSMutableArray * b = [a mutableCopy];

a = nil;

NSMutableArray * b2 = [NSMutableArray arrayWithCapacity:b.count];


int i = 0;

for (NSString * kl in b) 

{

NSMutableString * goodString = [NSMutableString stringWithCapacity:kl.length];

BOOL chars = YES;

if (chars) {

NSString * utf8string = kl;

for (int j = 0; j < utf8string.length; j++) {

if ([utf8string characterAtIndex:j] > 127) {

NSString * precontext = [utf8string substringWithRange:NSMakeRange(MAX(0,j-15), 15)];

NSString * wrongstring = @"";

while ([utf8string characterAtIndex:j] > 127) {

wrongstring =  [wrongstring stringByAppendingString:[utf8string substringWithRange:NSMakeRange(j, 1)]];

j++;

}

j--;

NSString * postcontext = [utf8string substringWithRange:NSMakeRange(j+1, MIN(15, utf8string.length - j - 1))];

const char * tempISOstring = [wrongstring cStringUsingEncoding:NSISOLatin1StringEncoding];

NSString * correctedISOstring = nil;

if (tempISOstring) {

correctedISOstring = [[NSString alloc] initWithBytes:tempISOstring length:strlen(tempISOstring) encoding:NSUTF8StringEncoding];

}

const char * tempCPstring = [wrongstring cStringUsingEncoding:NSWindowsCP1252StringEncoding];

NSString * correctedCPstring = nil;

if (tempCPstring) {

correctedCPstring = [[NSString alloc] initWithBytes:tempCPstring length:strlen(tempCPstring) encoding:NSUTF8StringEncoding];

}

if ([correctedCPstring isEqualToString: correctedISOstring]) {

NSLog(@"Line %i, char %i OK  → %@%@%@", [b indexOfObject:kl], j, precontext, correctedISOstring, postcontext);

[goodString appendString:correctedISOstring];

}

else {

if (!(correctedCPstring || correctedISOstring)) {

NSLog(@"*** Line %i, char %i ***FAIL*** %@__%@__%@", [b indexOfObject:kl], j, precontext, wrongstring, postcontext);

[goodString appendString:wrongstring];

}

else if (!correctedISOstring) {

NSLog(@"Line %i, char %i CP1252 → %@__%@__%@ (Latin1 is nil)", [b indexOfObject:kl], j, precontext, correctedCPstring, postcontext);

[goodString appendString:correctedCPstring];

}

else if (!correctedCPstring) {

NSLog(@"Line %i, char %i LATIN1 → %@__%@__%@ (CP1252 is nil)", [b indexOfObject:kl], j, precontext, correctedISOstring, postcontext);

[goodString appendString:correctedISOstring];

}

else {

NSLog(@"*** Line %i, char %i ***DIFFER*** %@__%@__%@ (Latin1) vs %@__%@__%@ (CP1252)", [b indexOfObject:kl], j, precontext, correctedISOstring, postcontext, precontext, correctedCPstring, postcontext);

[goodString appendString:correctedCPstring];

}

}

}

else {

// no correction necessary, just add the string

[goodString appendString:[utf8string substringWithRange:NSMakeRange(j, 1)]];

}

}

} else {

NSLog(@"Problem with line %i --- DUMPED", [b indexOfObject:kl]);

}

i++;

[b2 addObject:goodString];

}

NSString * s3 = [b2 componentsJoinedByString:@"\n"];

[s3 writeToFile:@"/tmp/fixeddatabase.sql" atomically:NO encoding:NSUTF8StringEncoding error:NULL];

return 0;

}

[Download as XCode 3 Project]

Running the code gave that more than 4000 characters in my database needed fixing and unfortunately almost a hundred of them were problematic and couldn’t be converted by the code. Luckily most of them were closing quotes ” (which are E2 80 9D in UTF-8 with 9D being a control character in both Windows-1252 and ISO-Latin-1 which probably makes the conversion fail) and a number of other characters with similar problems required manual adjustment in the end.

I’d say this is far from the best solution possible. But then again having databases which fuck up your text is where being far from the best solution possible began. I’ll just hope that this doesn’t happen again. Haha, that joke will be on me… 

July 9, 2008, 0:24

Tagged as cocoa, code, cp1252, encoding, foundation, mac roman, mysql, utf-8.

Comments

Comment by d.w.: User icon

Wow. Another vote for blog posts in text files — at least when I %*# up my encodings there, I know I’m at fault and how to fix things — with a regex (and/or UnicodeChecker ;) ), as opposed to a great scary bunch of C.

July 9, 2008, 17:31

Comment by ssp: User icon

Yeah, Blosxom is very charming for its simplicity.

Can you judge how resource consuming it is both for your maintanance and the machine running it once the data in it grows?

I’m in the process of bitching about Movable Type’s resource hogging, so I’m wondering how other engines do. How much RAM and CPU time should publishing a page cost? How much should accepting a comment spam cost?

July 10, 2008, 0:36

Comment by d.w.: User icon

I’ve never benchmarked or profiled it, though I seem to recall someone did. My recollection is that, in dynamic mode, it was a little “piggy” until I installed the entriescache plugin, at which point the performance became such that I could play videogames on the machine serving the blog without dropping frames.

July 10, 2008, 18:03

Add your comment

« MySQL vs. UTF-8MainMovable Type »

Comments on

Photos

Categories

Me

This page

Out & About

pinboard Links

People

Ego-Linking