Dave has a mysterious post up. I am not too much concerned with its content but rather with its form (although said content is actually both amusing, interesting and very mildly related to what I want to write about). Basically, its URL contains an accented character é. I found that remarkable as I haven't seen it before and like to play it safe in the whole encoding department.
My first guess was that this is some nice trickery by Safari to make URLs look better and replacing the 'percent escaped' characters with something prettier – after all it also does IDNA these days. But my guess was wrong, Dave's source also contained the accented character. Being a bit curious, I got out good old UnicodeChecker to see how all this plays with the standard 'percent escaped' URLs.
Dave's URL ends in
détente.html. After conversion to the escaped URL this becomes
de%CC%81tente.html. Note how this is a decomposed form of the accented letter with UTF-8
U+0301, the combining acute accent: ́ .
Upon entering the converted URL into Safari, Safari will load the page correctly and even display the 'proper' accented letter in its URL bar. But things won't work this smoothly all the way through: Pretend you actually want to enter the accented character. Using my keyboard layout this won't give me the combined character we've seen above, but rather 'latin small letter e with acute', which resides at Unicode position
U+00E9 and is represented as
0xC3A9 in UTF-8. Hence what you entered would be escaped as
d%C3%A9tente.html, i.e. not match what we had before.
To cut a long story short, the page will fail to load because of this. The explanation for this is that Dave's server runs on MacOS X on an HFS+ volume (well, I extrapolated the former from his site, and the latter from this phenomenon), and stores file names as decomposed Unicode, rather than the combined accented characters that seem to be more common otherwise.
Now that we have an explanation in place, is this good? I tend to think it's not. I can't enter the URL manually as it is now although it is a perfectly typeable URL (also on English keyboards). I am not sure who is to blame for this: Safari? the server software? Do URLs have to be standardised further? Or are they actually and they just aren't implemented correctly in one of the softwares involved?
Also enjoy the confusion already documented by Dave.
As soon as I saw Sam’s post I knew following the various screwups as it was digested by various buggy pieces of software across the internet would be highly amusing. I chose to amplify the fun by deliberately choosing an “interesting” permalink. Glad to see it’s providing entertainment to someone else. :)
There is an emerging standard for this. It’s called IRIs (Internationalized Resource Identifiers). You can find the latest version here. It is mostly finished, and has been handed over to the IESG.
According to the IRI spec, servers should expose IRIs in NFC (composed) rather than in NFD (decomposed). So in the above example, the server is the piece that should be fixed.
Let me just check whether I understood correctly: Dave’s server software (Blosxom, which is sort-of file system based) should accept composed strings and map them to the decomposed form before looking for the corresponding file on the file system.
As I pointed out in the post storing filenames in the decomposed form is a feature of the HFS+ file system. And in fact, file names containing the non-decomposed form should be illegal on such file systems.
A quick check gives that Apache handles the situation all right. Requesting an URL in either form will give access to the correct file.
Now I wonder, on which level things go wrong? Is it the Blosxom software (which, I suppose should be able doing its job without knowledge of the filesystem used)? Is is Perl (which Blosxom seems to be written in)? Or is the problem even lower in the system itself?
Finally, if IRIs are to become a standard. How long do you think will it take for them to be actually usable? As for IDNA, it seems to have taken a while until even a few browsers had support out of the box.
I’m sure this only makes analysis even more complicated, but my blog URLs are actually Apache ScriptAliases, so a URL like “http://www.freeke.org/ffg/foo/bar.html” is translated internally to “http://www.freeke.org/cgi-bin/blosxom.cgi?path=/foo/bar.html”
I don’t read Perl well enough to be able to completely understand what Blosxom’s doing with the path it’s being handed (by modalias by way of modcgi?), but when it comes to actually finding and retrieving the posts in the filesystem, it strictly relies upon Perl’s file handling routines.
Ouch, complicated. All the perl I’ve done was a few lines a couple of years ago. If you know enough perl to open a file with why don’t you try to open the file in question - once with the composed accented character and once with the decomposed one? I’d be interested to see the result.