Labyrinths vs. PyObjC vs. Encodings (Quarter Life Crisis)

Labyrinths vs. PyObjC vs. Encodings¶

1153 words on Mac OS X

Rather than bitching about stupid journalists quoting stupid Austrian academics (well at least one of the parties must have been stupid, it’s hard to tell which) on the topic of Labyrinths which in turn led to my social sciences flatmate to get a completely wrong impression of what a labyrinth or maze is because he’ll simply believe anything written on paper (aka social science thinking) rather than a perfectly logical argument I present to him (aka thinking), let me bitch about something else…

Like the fact that some complete frigtard managed to produce a TeX file which contained Roman text along with cyrillic characters. Well, that shouldn’t be a big problem. However, it is a problem if - and that’s how we reconstructed things - the document was created in the following way:

The document was first written as a TeX document in Russian in the 1990s. It was written on a Mac and saved using the Mac Cyrillic encoding.
Cyrillic lookalike characters were used for Roman ones in a number of places. For example Н (Cyrillic capital letter EN) instead of H (Latin capital letter H).
Then someone (else) created an English translation of the document and didn’t re-type all of the formulas but simply copy and pasted them from the Russian version.

This is hard to notice when looking at the text file in the right encoding and it’s rather tricky to find out what exactly happened there as well. (Just try to find out the exact encoding of a file in a Linux GUI if you don’t really know what you’re doing… and who’d guess Mac-friggin-Cyrillic for it?) But it certainly leaves us with a document containing Cyrillic characters where they shouldn’t be.

The next problems were to find all the Cyrillic characters in the document and to replace them. That should be fairly trivial but with the tools I have at hand I found it quite complicated. [read: if you know a one line regular expression to achieve the same thing, please leave a comment!] A straightforward change of file encoding didn’t do the trick as this replacement of characters is no such thing. So after a bit of back and forth I ended up doing the following: Open the file in a text editor as Mac Cyrillic. This made sure the wrong characters are correctly interpreted; Copy the file’s text to the clipboard. Hack together the following PyObjC script to replace characters in the clipboard and write them to a file - printing out the conversions done and the potentially remaining problems to the console:

#!/usr/bin/env python
#coding=utf-8
from AppKit import *

pb = NSPasteboard.generalPasteboard()
input = pb.stringForType_(NSStringPboardType).mutableCopy()

originalstring = u'ЗЃгНМЕКаСТ•А†бä'
newstring =  u'3oyHMEKaCTeAacä'
l = len(originalstring)

for i in range(0, l-1):
    a = originalstring[i]
    b = newstring[i]
    count = input.replaceOccurrencesOfString_withString_options_range_(a, b, NSLiteralSearch, NSMakeRange(0, input.length()))
    print a + "->" + b
    print count

a = input.rangeOfCharacterFromSet_( NSCharacterSet.characterSetWithRange_(NSMakeRange(0,127)).invertedSet())

print input[a.location-100:a.location+10]
print input.writeToFile_atomically_encoding_error_("/Users/ssp/Desktop/test.tex", 0, NSASCIIStringEncoding, None)

This still feels like it is a bit too complicated for the simple problem at hand. But it did the job. I think I’m starting to like PyObjC. To begin with, python as a language seems quite nice and intuitive. I haven’t read a manual but just from seeing a few examples I could deduce enough to get a working script. I wouldn’t say that this is necessarily a technically good thing for a language, but for a scripting language used for off-the-cuff hackish stuff like this it’s brilliant. [This certainly beats the atrocity known as perl or the hideousness of shell scripts.]

Then there’s Objective-C and the Cocoa frameworks. As I am familiar with them, having direct access is rather good and gives me the ability to do many things right away without needing to re-learn everything. Again, perfect for hacks like this. And it’s better than the ‘real thing’ because I don’t have the overhead of creating an XCode project, compiling and so on when doing this. A nice thing about Cocoa is its fault tolerance. You can just get away with passing zero pointers in many situations, leaving you with a working script even though things didn’t go perfectly. [That certainly beats AppleScript; As AppleScript is probably the worst language ever for string manipulation I wanted to avoid it here.]

Extra kudos go to the PyObjC implementors for making a bit of an effort in their error messages. Compared to other languages, these were actually helpful. There’s nothing as bad as starting a hack like this and running into a cryptic error message which halts progress. At that point you start losing loads of time. For some things - as the inclusion of non-ASCII characters in the script, an issue which I was really scared about considering the general Unicode incompetence in scripting languages - PyObjC simply spat out a message pointing right to a web page discussing the issue and telling me I want to write #coding=utf-8 at the top of my script. Another Google search made clear that I want to add a u in front of my string constants and things worked from there. That’s excellent.

Of course this script is still imperfect. It’s off-the-cuff by someone who doesn’t know what he’s doing. And I failed to do everything I wanted to achieve. A few questions that stuck follow. Any insight on the issues they present will be appreciated.

Is there a really simple way to get a collection of all ‘strange’ characters in a document? Just a list of all non-ASCII characters without duplicates in a line of Cocoa or a magic regular expression? That would have made it easier to gauge the size of the problem before starting.
I managed to find the ‘wrong’ characters one-by-one but I would only get their offset from the beginning of the string easily. When looking at things in a text editor at the same time you usually just have a command to jump to a certain line but not to a certain character in the file. What’s the easiest way to solve this?
It seems that the last character in my strings is ignored. (That’s why I added trivial replacement at the very end.)
python strings and NSStrings don’t seem to be exactly identical in PyObjC. For some Objective-C’s -length method works, for others I had to use Python’s len() function.
I couldn’t write to the clipboard. When doing so in NSPasteboard you need to pass an ‘owner’ to the method you are calling. Usually that is self. I tried entering self but PyObjC just gave an error. I tried passing other things like 0 or a string instead and that led to a rather unfortunate situation where OS X.5’s whole clipboard infrastructure was broken. pboard needed to be killed and any running application which tried to use the clipboard froze and needed to be killed when doing so. Hence I just wrote the results to a file…

April 16, 2008, 9:42

Tagged as cyrillic, encoding, Mac OS X, programming, pyobjc, tex, unicode, x.5.

Comments

Comment by Fred Blasdel: ¶

Your code got Markdowned into brokenness

I’ve never liked the underscore-underscore thing in Python, but overall there are very few annoyances compared to Ruby (which has way too much sugar, especially in the way that it screws up and slows down the implementations).

I’ve written several PyObj-C applications, often to write simple menubar helper apps without using Interface Builder or anything, just a REPL. I like it quite a bit!

April 16, 2008, 12:46

Comment by ssp: ¶

I’m battling Markdown right now… it always takes me by surprise.

Update: To me this looks more like a bug in Markdown than incompetence of myself. Colour me surprised.

April 16, 2008, 12:50

Comment by Michael Tsai: ¶

Non-ASCII characters: set([c for c in input if ord(c) > 255])

Pasteboard: If you aren’t using lazy writing, you can use None/nil for the owner. self should work, though, if it’s an Objective-C object.

April 16, 2008, 17:00

Comment by Steffen: ¶

You need to use range(0, l) as Pyhton’s range(i, j) will have j-1 as last element (see help(range))
In order to use NSPasteboard, just pass None as owner. Objective-C-Cocoa will happily accept nil as owner if you provide the pasteboard data immediately after declaring the types. None is the equivalent of nil in Python.

April 17, 2008, 12:07

Comment by ssp: ¶

@Steffen:
Thanks for the hints. None does the trick and knowing how range() works certainly helps.

I’d still say that passing the wrong parameter to Cocoa shouldn’t be able to stall all other applications. Filed a bug on that, let’s see how it goes.

@Michael:
Yes, None does work. But zero just screws things up. I thought self should work but it gave an error message for me. Perhaps because I didn’t define my own class and ‘self’ essentially was the whole script?

April 17, 2008, 13:10

Comment by Michael Tsai: ¶

In Objective-C, nil and 0 are interchangeable, but in Python None and 0 are not. PyObjC will convert None to a nil object, but it will convert 0 to an NSNumber, which is not a valid pasteboard owner.

In Python, self is not implicitly defined. If you don’t create a variable called self or have a method with self as an argument, self will be undefined.

April 18, 2008, 19:46

Comment by ssp: ¶

@Michael:
Aha, I’ll have to get used to that self thing, I suppose.

And I’d still say the whole clipboard infrastructure shouldn’t collapse even if I pass a blatantly inadequate object to it…

April 20, 2008, 2:58

Comment by Carl: ¶

Here are my copy and paste functions for Python.

def pbcopy(s):
    "Copy string argument to clipboard"
    board = AppKit.NSPasteboard.generalPasteboard()
    board.declareTypes_owner_([AppKit.NSStringPboardType], None)
    newStr = Foundation.NSString.stringWithString_(s)
    newData = \
        newStr.nsstring().dataUsingEncoding_(Foundation.NSUTF8StringEncoding)
    board.setData_forType_(newData, AppKit.NSStringPboardType)

def pbpaste():
    "Returns contents of clipboard"
    board = AppKit.NSPasteboard.generalPasteboard()
    content = board.stringForType_(AppKit.NSStringPboardType)
    return content

I then put an object around these to make using them from the Terminal more convenient. (For example, as written, pbcopy will crash if passed a non-string.)

class PasteBoard(object):
    def copy(self, s):
        if not isinstance(s, basestring):
            s = repr(s)
        pbcopy(s)
    paste = property(lambda self: pbpaste(), fset=copy)
    copy = property(lambda self: pbpaste(), fset=copy)

    def lines():
        def fget(self): 
            return pbpaste().replace("\r","\n").split("\n")

        def fset(self, l):
            pbcopy('\n'.join(unicode(i) for i in l))

        return {'fget':fget, 'fset':fset}
    lines = property(**lines())

    def split():
        def fget(self):
            def _(sep):
                return pbpaste().replace("\r"," ").replace("\n"," ").split(sep)
            return _

        def fset(self, t):
            pbcopy(unicode(t[0]).join(unicode(i) for i in t[1]))

        return {'fget':fget, 'fset':fset}
    split = property(**split())
    join = split

    def words():
        def fget(self):
            return pbpaste().replace("\r"," ").replace("\n"," ").split(" ")

        def fset(self, l):
            pbcopy(' '.join(unicode(i) for i in l))

        return {'fget':fget, 'fset':fset}
    words = property(**words())

    def to_plain(self):
        pbcopy(pbpaste())

    def to_ascii(self):
        pbcopy(pbpaste().encode("ASCII", "ignore"))

    def to_nonascii(self):
        pbcopy(''.join(char for char in pbpaste() if ord(char)>128))

    def to_indent(self):
        pbcopy('\n'.join('\t'+line for line in pbpaste().split("\n")))

    def to_dedent(self):
        lines = pbpaste().replace("\t", "    ").split("\n")
        lines = '\n'.join(line[4:] for line in lines)
        pbcopy(lines)

    def to_title(self):
        pbcopy(pbpaste().title())

pb = PasteBoard()

This can be used in the terminal like so:

>>> pb.copy = 1234
>>> pb.paste
u'1234'
>>> pb.to_indent()
>>> pb.paste
u'\t1234'
>>> pb.lines = pb.paste
>>> print pb.paste

1
2
3
4
>>> pb.copy = "the quick brown fox jumps over the lazy dog"
>>> sum(1 for word in pb.words)
9
>>> pb.words = (word for word in pb.words if "e" in word)
>>> pb.paste
u'the over the'
>>> pb.paste = u"I ♥ 日本語!!"
>>> pb.paste
u'I \u2665 \u65e5\u672c\u8a9e!!'
>>> print ''.join(char for char in pb.paste if ord(char)<128)
I  !!

I find it convenient enough.

April 20, 2008, 7:32

Quarter Life Crisis

The world according to Sven-S. Porst