Quarter Life Crisis

The world according to Sven-S. Porst

« Designing InterfacesMainWebKit quirk 1 »

Labyrinths vs. PyObjC vs. Encodings

1153 words on

Rather than bitching about stupid journalists quoting stupid Austrian academics (well at least one of the parties must have been stupid, it’s hard to tell which) on the topic of Labyrinths which in turn led to my social sciences flatmate to get a completely wrong impression of what a labyrinth or maze is because he’ll simply believe anything written on paper (aka social science thinking) rather than a perfectly logical argument I present to him (aka thinking), let me bitch about something else…

Like the fact that some complete frigtard managed to produce a TeX file which contained Roman text along with cyrillic characters. Well, that shouldn’t be a big problem. However, it is a problem if - and that’s how we reconstructed things - the document was created in the following way:

This is hard to notice when looking at the text file in the right encoding and it’s rather tricky to find out what exactly happened there as well. (Just try to find out the exact encoding of a file in a Linux GUI if you don’t really know what you’re doing… and who’d guess Mac-friggin-Cyrillic for it?) But it certainly leaves us with a document containing Cyrillic characters where they shouldn’t be.

The next problems were to find all the Cyrillic characters in the document and to replace them. That should be fairly trivial but with the tools I have at hand I found it quite complicated. [read: if you know a one line regular expression to achieve the same thing, please leave a comment!] A straightforward change of file encoding didn’t do the trick as this replacement of characters is no such thing. So after a bit of back and forth I ended up doing the following: Open the file in a text editor as Mac Cyrillic. This made sure the wrong characters are correctly interpreted; Copy the file’s text to the clipboard. Hack together the following PyObjC script to replace characters in the clipboard and write them to a file - printing out the conversions done and the potentially remaining problems to the console:

#!/usr/bin/env python
#coding=utf-8
from AppKit import *

pb = NSPasteboard.generalPasteboard()
input = pb.stringForType_(NSStringPboardType).mutableCopy()

originalstring = u'ЗЃгНМЕКаСТ•А†бä'
newstring =  u'3oyHMEKaCTeAacä'
l = len(originalstring)

for i in range(0, l-1):
    a = originalstring[i]
    b = newstring[i]
    count = input.replaceOccurrencesOfString_withString_options_range_(a, b, NSLiteralSearch, NSMakeRange(0, input.length()))
    print a + "->" + b
    print count

a = input.rangeOfCharacterFromSet_( NSCharacterSet.characterSetWithRange_(NSMakeRange(0,127)).invertedSet())

print input[a.location-100:a.location+10]
print input.writeToFile_atomically_encoding_error_("/Users/ssp/Desktop/test.tex", 0, NSASCIIStringEncoding, None)

This still feels like it is a bit too complicated for the simple problem at hand. But it did the job. I think I’m starting to like PyObjC. To begin with, python as a language seems quite nice and intuitive. I haven’t read a manual but just from seeing a few examples I could deduce enough to get a working script. I wouldn’t say that this is necessarily a technically good thing for a language, but for a scripting language used for off-the-cuff hackish stuff like this it’s brilliant. [This certainly beats the atrocity known as perl or the hideousness of shell scripts.]

Then there’s Objective-C and the Cocoa frameworks. As I am familiar with them, having direct access is rather good and gives me the ability to do many things right away without needing to re-learn everything. Again, perfect for hacks like this. And it’s better than the ‘real thing’ because I don’t have the overhead of creating an XCode project, compiling and so on when doing this. A nice thing about Cocoa is its fault tolerance. You can just get away with passing zero pointers in many situations, leaving you with a working script even though things didn’t go perfectly. [That certainly beats AppleScript; As AppleScript is probably the worst language ever for string manipulation I wanted to avoid it here.]

Extra kudos go to the PyObjC implementors for making a bit of an effort in their error messages. Compared to other languages, these were actually helpful. There’s nothing as bad as starting a hack like this and running into a cryptic error message which halts progress. At that point you start losing loads of time. For some things - as the inclusion of non-ASCII characters in the script, an issue which I was really scared about considering the general Unicode incompetence in scripting languages - PyObjC simply spat out a message pointing right to a web page discussing the issue and telling me I want to write #coding=utf-8 at the top of my script. Another Google search made clear that I want to add a u in front of my string constants and things worked from there. That’s excellent.

Of course this script is still imperfect. It’s off-the-cuff by someone who doesn’t know what he’s doing. And I failed to do everything I wanted to achieve. A few questions that stuck follow. Any insight on the issues they present will be appreciated.

April 16, 2008, 9:42

Tagged as cyrillic, encoding, Mac OS X, programming, pyobjc, tex, unicode, x.5.

Comments

Comment by Fred Blasdel: User icon

Your code got Markdowned into brokenness

I’ve never liked the underscore-underscore thing in Python, but overall there are very few annoyances compared to Ruby (which has way too much sugar, especially in the way that it screws up and slows down the implementations).

I’ve written several PyObj-C applications, often to write simple menubar helper apps without using Interface Builder or anything, just a REPL. I like it quite a bit!

April 16, 2008, 12:46

Comment by ssp: User icon

I’m battling Markdown right now… it always takes me by surprise.

Update: To me this looks more like a bug in Markdown than incompetence of myself. Colour me surprised.

April 16, 2008, 12:50

Comment by Michael Tsai: User icon

Non-ASCII characters: set([c for c in input if ord(c) > 255])

Pasteboard: If you aren’t using lazy writing, you can use None/nil for the owner. self should work, though, if it’s an Objective-C object.

April 16, 2008, 17:00

Comment by Steffen: User icon

April 17, 2008, 12:07

Comment by ssp: User icon

@Steffen:
Thanks for the hints. None does the trick and knowing how range() works certainly helps.

I’d still say that passing the wrong parameter to Cocoa shouldn’t be able to stall all other applications. Filed a bug on that, let’s see how it goes.

@Michael:
Yes, None does work. But zero just screws things up. I thought self should work but it gave an error message for me. Perhaps because I didn’t define my own class and ‘self’ essentially was the whole script?

April 17, 2008, 13:10

Comment by Michael Tsai: User icon

In Objective-C, nil and 0 are interchangeable, but in Python None and 0 are not. PyObjC will convert None to a nil object, but it will convert 0 to an NSNumber, which is not a valid pasteboard owner.

In Python, self is not implicitly defined. If you don’t create a variable called self or have a method with self as an argument, self will be undefined.

April 18, 2008, 19:46

Comment by ssp: User icon

@Michael:
Aha, I’ll have to get used to that self thing, I suppose.

And I’d still say the whole clipboard infrastructure shouldn’t collapse even if I pass a blatantly inadequate object to it…

April 20, 2008, 2:58

Comment by Carl: User icon

Here are my copy and paste functions for Python.

def pbcopy(s):
    "Copy string argument to clipboard"
    board = AppKit.NSPasteboard.generalPasteboard()
    board.declareTypes_owner_([AppKit.NSStringPboardType], None)
    newStr = Foundation.NSString.stringWithString_(s)
    newData = \
        newStr.nsstring().dataUsingEncoding_(Foundation.NSUTF8StringEncoding)
    board.setData_forType_(newData, AppKit.NSStringPboardType)

def pbpaste():
    "Returns contents of clipboard"
    board = AppKit.NSPasteboard.generalPasteboard()
    content = board.stringForType_(AppKit.NSStringPboardType)
    return content

I then put an object around these to make using them from the Terminal more convenient. (For example, as written, pbcopy will crash if passed a non-string.)

class PasteBoard(object):
    def copy(self, s):
        if not isinstance(s, basestring):
            s = repr(s)
        pbcopy(s)
    paste = property(lambda self: pbpaste(), fset=copy)
    copy = property(lambda self: pbpaste(), fset=copy)

    def lines():
        def fget(self): 
            return pbpaste().replace("\r","\n").split("\n")

        def fset(self, l):
            pbcopy('\n'.join(unicode(i) for i in l))

        return {'fget':fget, 'fset':fset}
    lines = property(**lines())

    def split():
        def fget(self):
            def _(sep):
                return pbpaste().replace("\r"," ").replace("\n"," ").split(sep)
            return _

        def fset(self, t):
            pbcopy(unicode(t[0]).join(unicode(i) for i in t[1]))

        return {'fget':fget, 'fset':fset}
    split = property(**split())
    join = split

    def words():
        def fget(self):
            return pbpaste().replace("\r"," ").replace("\n"," ").split(" ")

        def fset(self, l):
            pbcopy(' '.join(unicode(i) for i in l))

        return {'fget':fget, 'fset':fset}
    words = property(**words())

    def to_plain(self):
        pbcopy(pbpaste())

    def to_ascii(self):
        pbcopy(pbpaste().encode("ASCII", "ignore"))

    def to_nonascii(self):
        pbcopy(''.join(char for char in pbpaste() if ord(char)>128))

    def to_indent(self):
        pbcopy('\n'.join('\t'+line for line in pbpaste().split("\n")))

    def to_dedent(self):
        lines = pbpaste().replace("\t", "    ").split("\n")
        lines = '\n'.join(line[4:] for line in lines)
        pbcopy(lines)

    def to_title(self):
        pbcopy(pbpaste().title())

pb = PasteBoard()

This can be used in the terminal like so:

>>> pb.copy = 1234
>>> pb.paste
u'1234'
>>> pb.to_indent()
>>> pb.paste
u'\t1234'
>>> pb.lines = pb.paste
>>> print pb.paste

1
2
3
4
>>> pb.copy = "the quick brown fox jumps over the lazy dog"
>>> sum(1 for word in pb.words)
9
>>> pb.words = (word for word in pb.words if "e" in word)
>>> pb.paste
u'the over the'
>>> pb.paste = u"I ♥ 日本語!!"
>>> pb.paste
u'I \u2665 \u65e5\u672c\u8a9e!!'
>>> print ''.join(char for char in pb.paste if ord(char)<128)
I  !!

I find it convenient enough.

April 20, 2008, 7:32

Add your comment

« Designing InterfacesMainWebKit quirk 1 »

Comments on

Photos

Categories

Me

This page

Out & About

pinboard Links

People

Ego-Linking