Quarter Life Crisis

The world according to Sven-S. Porst

« WeddingMainRechnungs Checker 1.0.11 »

UnicodeChecker 1.10

859 words on

UnicodeChecker Icon Hooray for UnicodeChecker. It finally reached the magic 1.10 version number! The changes to the application are non-trivial but may be hard to see.

The most apparent change to UnicodeChecker is that it (and naturally all of its plugins) is now a universal binary. This is what the early adopters have been waiting for. I had expected that UnicodeChecker was in danger of being very hard to universalise. With things like parsing the Unicode data files to do and a number of algorithms for handling multi-byte encodings of characters being in the application the chance of running into some Endianness problems looked like a pretty good one.

And for sure, such issues did exist and needed to be solved. But Steffen said they were manageable rather than hopeless and thus the Universal version came to exist. Increase of file size here we come… A further increase of the file size is caused by the Growl framework now being shipped with UnicodeChecker. While personally I really dislike Growl, it seems to be quite popular.

And now for the really good stuff. For some time now, UnicodeChecker has been capable of reading the Unihan data, which is around 20MB of information about Asian characters. And UnicodeChecker will also use the ‘kDefinition’ data in its find sheet. That’s quite cool if you’re looking for Chinese characters for some word, say. You type in ‘fish’ and you’ll get not only all characters whose Unicode name contains ‘fish’ but also all characters whose definition contains fish. There were two problems with that feature, though. One was that for some search terms you’ll get very long – and thus not very helpful – lists of results. And if you weren’t looking for the meaning of Asian characters to begin with, you may be annoyed by that. The other was speed. The live filtering stopped being responsive once the Unihan data started being searched after every keystroke. A new check box will let you decide whether you want the deeper search or the better performance now.

Include Unihan Definition checkbox in UnicodeChecker's find sheet.

Finally, and most importantly, UnicodeChecker’s AppleScript capabilities have been extended dramatically. We now have a class for code points which lets you query most of the information displayed for code points in UnicodeChecker’s main window from AppleScript. You can even access the Unihan data from AppleScript now.

That said, there are a number of caveats about these new AppleScript features. It’s advisable to have a look at UnicodeChecker’s help book for more information on the topic. Most notably you can in principle send UnicodeChecker an AppleScript command like:

get every code point whose unicode name contains "fish"
But before doing that it’s worth keeping in mind that there are more than a million code points and that Cocoa’s default way to evaluate those whose queries is to cycle through all of them and doing the necessary queries on each of them – which is very slow. So if you want to see excessive memory usage by UnicodeChecker and to learn about AppleScript timeouts, it may be worth running such commands…

In other cases – when slightly less generality is required – you can significantly speed things up by restricting the lookup. To plane 0 of Unicode, say. And thus using

tell plane id 0
   get every code point whose unicode name contains "fish"
end tell
instead of the command above will be significantly faster.

In principle you – as a programmer – should be able to catch an AppleScript command coming to your application, analyse the request yourself and provide a more efficient algorithm than the dumb cycling through all objects provided by Cocoa. But code examples for that seem to be very rare or non-existant (Be sure to send in everything you know about!). And judging from the usually abysmal performance of Apple’s own applications in AppleScript, I doubt that even Apple make significant use of that technique.

Finally let me mention another embarrassing oddity of the new AppleScript features. Unicode code points are numbered starting from 0 while AppleScript’s natural way of numbering things starts at 1. And thus the little snowman ☃ (U+9731) can be accessed as ‘code point id 9731’ or ‘code point 9732’. In other words: Make sure that you don’t forget to write the ‘id’ in there. This is the kind of bad feature you’d want to avoid, but there seemed to be no good way to do that.

Unfortunately AppleScript and Unicode aren’t the best partners. AppleScript has a number of different string / text classes which behave differently. Even worse, some of its UTF-16 related features suffer from what I’d euphemistically call ‘Endian-indifference’. So a bit of care will have to be taken when trying to do real Unicode work in AppleScript and wanting to run that on both PowerPC and Intel Macs. And when wanting to use UnicodeChecker for that, be sure to have a look at the help book first which contains a number of potentially helpful remarks and an example script.

Otherwise just download the current version, lean back and enjoy a crappy sample script that has to be stopped by hitting Escape in Script Editor.

May 22, 2006, 20:46

Tagged as earthlingsoft, UnicodeChecker.

Add your comment

« WeddingMainRechnungs Checker 1.0.11 »

Comments on




This page

Out & About

pinboard Links


Received data seems to be invalid. The wanted file does probably not exist or the guys at last.fm changed something.