Quarter Life Crisis

The world according to Sven-S. Porst

« UnicodeChecker 1.8MainWir sind Helden in English »

Identifying text files

951 words on

The combination of a couple of different observations led to this post. One of them is that when I’m writing drafts of posts it has proven to be advantageous to name the files ‘postname.html’ rather than just ‘postname’ which’d be more natural. One advantage of that is that text editors tend to activate their HTML highlighting based on how a file’s name ends. Another is that you’ll be able to find the file using Spotlight. Which means that the file won’t be indexed without its name ending in ‘.html’.

The same applies to the files I download from this site for local reference. As you may be able to see, the web pages for these posts have simple file names that aren’t uglyfied by having their name end in ‘.html’ or other meaningless heaps of characters. And this is the reason why the files weren’t indexed after being downloaded – thus defeating the main purpose of downloading them in the first place (which is why I now have an embarrassing Automator script – that I could’ve easily done in good old AppleScript as well of course – to add ‘.html’ at the end of the file names).

The third time I was pointed at the problem was when reading a recent comment by Mike A on the Spotlight hacking I described. Part of the (otherwise somewhat mysterious) comment was that Mike noted that SubEthaEdit’s files won’t be indexed by Spotlight unless their name ends in some kind of ‘.html’, ‘.txt’ junk. Which means this is the same problem once more.

So I finally decided to take a closer look at it and my short conclusion is that OS X.4 has severe difficulties with identifying text files.

More technically, recall that there currently are two ways for the System to learn about a file’s type: Its file Type as stored on the (HFS) volume. And the end of its name. Text files are the most basic of files and they are commonplace. Even if you wanted to name all your text files with names ending in ‘.txt’ or something, this often isn’t an option as files might need to have particular names when they are to be used by other programs. Or as you might just have downloaded them and didn’t pick their name yourself (with OS X apparently not being clever enough to store the file’s MIME type in the file system for downloaded files, thus losing information – although there seem to be embarrassingly stupid attempts to simulate those effects by Safari which occasionally seems to change the file name by adding some extension to it without warning the user). With these examples along you are bound to end up with a Spotlight index which is much less complete than you’d wish.

A bigger effort has to be made to recognise file types. At least for text. Tools like file seem to be quite good at it and make low-level Unix stuff look much better than the latest shiny Mac fashion:

[nssp:~] ssp% echo "Hello World!" > testfile
[nssp:~] ssp% mdimport -d1 testfile
2005-07-10 20:03:14.773 mdimport[16865] Import '/Users/ssp/testfile' type 'public.data' no mdimporter
[nssp:~] ssp% file testfile
testfile: ASCII text
So, yep, there’s a file which is obviously a text file and Spotlight won’t index it because it is considered to be of type public.data which is about as general as it could be and won’t have an indexer. I guess, I could just change the text importer to also cover the public.data UTI but I’m pretty sure that’d result in plenty of unwanted side effects as it means that the importer will have to look at any file that Spotlight considers worth of being indexed and for which there is no other importer, thus potentially causing a lot of unwanted load on the system.

But we still have the Mac’s other and traditionally preferred method of storing file type data: the file’s Type. For text files that’s the ‘TEXT’ file type. What’s odd about X.4, though, is that files of that Type will be assigned the really strange com.apple.traditional-mac-plain-text UTI. So what does that mean? At first, the name sounds like, well, it might designate a file in Mac Roman encoding with carriage returns marking new paragraphs. But then you scratch your head and remember all the international versions of the MacOS where SimpleText would happily write files containing text of non-Mac Roman, say some Asian, encodings into text files. And you’ll remember all the files you got on floppy disks from your MS-DOS friends which PC Exchange happily gave the ‘TEXT’ file type.

So that UTI can’t really signify any precise information about a text file beyond it being text. Which makes me wonder in which respect that UTI is different from public.text. More significantly, will it make sense to use the ‘TEXT’ Type in modern applications despite it being marked as ‘traditional’? Just by its name it looks like you shouldn’t. But on the other hand, marking a file with the ‘TEXT’ type is the only way I could see to get it into the Spotlight index without breaking its name.


P.S. SubEthaEdit seems to give the files it creates a ‘****’ Type. I’m not exactly sure what the advantage of that is, but I guess there was some reason for that. Opening the application’s ‘Info.plist’ file and replacing ‘****’ by ‘TEXT’ will make the newly generated files to be of Type ‘TEXT’ and thus be indexed by Spotlight. However, this will give the file a TextEdit icon by default for some reason. It’s a bit odd. Perhaps SubEthaEdit could handle this a bit better. But I think it’s mainly a problem of the System itself.

July 11, 2005, 1:37

Tagged as Mac OS X.

Comments

Comment by Nicholas Riley: User icon

I don’t see the big deal really with ‘TEXT’ being com.apple.traditional-mac-plain-text as long as it also counts as being public.text and public.plain-text, as it does:

[p5:1036] ~%SetFile -t TEXT foo                                                                                       9:20PM
[p5:1037] ~%mdls foo                                                                                                  9:20PM
foo -------------
kMDItemAttributeChangeDate     = 2005-07-10 21:20:32 -0500
kMDItemContentCreationDate     = 2005-07-10 21:20:21 -0500
kMDItemContentModificationDate = 2005-07-10 21:20:21 -0500
kMDItemContentType             = "com.apple.traditional-mac-plain-text"
kMDItemContentTypeTree         = (
    "com.apple.traditional-mac-plain-text", 
    "public.plain-text", 
    "public.text", 
    "public.data", 
    "public.item", 
    "public.content"
)
kMDItemDisplayName             = "foo"
kMDItemFSContentChangeDate     = 2005-07-10 21:20:21 -0500
kMDItemFSCreationDate          = 2005-07-10 21:20:21 -0500
kMDItemFSCreatorCode           = 0
kMDItemFSFinderFlags           = 0
kMDItemFSInvisible             = 0
kMDItemFSLabel                 = 0
kMDItemFSName                  = "foo"
kMDItemFSNodeCount             = 0
kMDItemFSOwnerGroupID          = 20
kMDItemFSOwnerUserID           = 501
kMDItemFSSize                  = 4
kMDItemFSTypeCode              = 1413830740
kMDItemID                      = 6840590
kMDItemKind                    = "Text document"
kMDItemLastUsedDate            = 2005-07-10 21:20:21 -0500
kMDItemUsedDates               = (2005-07-10 21:20:21 -0500)

The downside to indexing every file that ‘looks’ like a text file, for average users, is that a lot of Unix configuration files that are otherwise hidden spring to the surface.

I think the current compromise is fine for a first release; there are other issues that are much more important, most of which you’ve written about (horrible menubar/window search UI, gigantic steps backwards in the Finder’s searching, poor asynchronous searching, not being able to chain importers, …)

July 11, 2005, 4:24

Comment by ssp: User icon

I agree that using the ‘TEXT’ way of indicating the file’s type will work. I’m just wondering why it maps to a UTI with such a strange name.

In addition not all files will have Types set, so when downloading stuff from the internet, say, you’ll still end up with files which should be indexed but won’t be as the System can’t figure out that they’re text files.

As for the Unix configuration files… as Spotlight’s indexing is severely limited in its default setup, I’m not sure that this will cause too many problems. But if Spotlight is going to offer decent (i.e. complete) find capabilities to the System in the future, this might need some consideration.

July 11, 2005, 14:45

Comment by bruce: User icon

I think I am not too late - as of August 2006, with the OS 10.4.7 installed, things are still as you described. If we are to judge from the results, all this means that TextEdit gets the lion’s share. Is it intentional, on Apple’ side? I am not keen to think it is not. Looking at the “positive” reasons to take such a road, one must think that a lot of Mac-users wich stick at OS X are not forced to get mad with “old” “Classic” text files: double-click them, and TextEdit will take care of them (I find “drudgerous”, instead, the renaming procedure for the re-saving of these files). This consideration does not mean I am satisfied. There are applications I like, like Nisus Writer, which become “unnatural” to use. All we now, however, that Classic environment is at its life’s end. It is natural to suppose that this is a main consideration, in Apple’s “mind”.

August 24, 2006, 20:00

Add your comment

« UnicodeChecker 1.8MainWir sind Helden in English »

Comments on

Photos

Categories

Me

This page

Out & About

pinboard Links

People

Ego-Linking