Pushing Spotlight (Quarter Life Crisis)

Pushing Spotlight¶

2551 words on Mac OS X

After I wrote about Spotlight recently, Dave remarked that, strangely, Spotlight doesn’t index some Illustrator files although they are just normal PDF files and can even be displayed with Preview.

My gut feeling about this was immediately that if those files really are as standard PDFs as they seem to be, Apple’s PDF importer should be able to import them just fine without any need to have – and, more painfully, perhaps, wait for – a separate Illustrator Spotlight importer. And at least on the sample file from the latest and greatest Illustrator that I was provided with, this was true… just rename to the file to have its name end in ‘.pdf’, swear at the Finder while it throws annoyingly stupid questions at you, and the file will be imported just fine.

So the problem here wasn’t the importer’s code but just its inability to recognise the Illustrator file as one it can import. It turns out that the change the importer needs is just a minor one. And I’ll outline how you can find out what the System thinks about your files and how to improve things a bit for the case of Illustrator files in what follows. None of this is terribly original and it’s mostly documented, so don’t expect any magic. If you have read some of X.4’s technical documentation what follows will be quite boring, if you haven’t it may save you some time and make your computer more useful.

I’ll start with an overview about the new UTIs and what they are about as they are central in this. You don’t really need to read that section to fix the problem but doing so will help you to understand what you’re doing and why it does the trick in this case.

UTIs

A central point for the importing of files are Universal Type Identifiers, called UTI in short. They are a great new concept to identify data types to software. While the concept is quite general, we only need it for files here. You’ll probably have seen files whose names are defaced by ending in ‘.somecrypticstuff’. Unfortunately software makers, including those at Apple since OS X, thought that adding some cryptic stuff at the end of a file name is a good way to tell the computer what type of file it is dealing with. That’s of course wrong, as it is probably the worst way to do exactly that. Not only will this confuse and restrict you, the user, the computer’s boss, in how you can name your files.

It’s also highly ambiguous and imprecise, even when it’s used without making mistakes. When seeing a file’s name ending in ‘.dat’ it’ll be interesting to guess what kind of file it is. When seeing a file’s name ending in ‘.doc’ or ‘.pdf’ you have no idea which version of Word or Acrobat will be able to open the file. When seeing a file whose name ends in ‘.txt’, there remain many open questions, about the text’s encoding, say, which is essential to display the text properly. Or think about the many different things that a file whose name ends in ‘.html’ can contain… different encodings, different versions of HTML, with or without extras like PHP.

Traditionally, we were better off on the Mac – and still are if we are using applications written by considerate programmers like myself – and files will have their Type stored in the file system, meaning that a file can be named arbitrarily but the computer will know what kind of file it is anyway. While being much nicer for the user, this system still had potential problems like developers not registering their codes with Apple and thus the same code being assigned to different file formats. I consider that to be more of a theoretical problem, though, and never ran into it. More practically relevant is, that the information provided by the Type usually isn’t precise enough. While we may be able to handle different versions of Word’s file format that way, we still have the problem about text encodings or the various flavours of HTML.

There’s also the third concept of MIME-Types which are used a lot on the web. When receiving a file, the web browser needs to know its type – is it a web page, or an image, or an MP3, or a PDF? While most web pages have names with file name extensions, those are mostly a convenience for the people managing those pages – as all information other than file name extensions tends to get lost when transferred via FTP, say. They’re not strictly necessary for the web to work. Peeking at your browser’s address bar right now, may help to convince you of that. In addition, MIME-Types might be considered to be slightly more informative than the Mac’s Types as they are usually of the form ‘text/html’ or ‘image/jpeg’, i.e. with two levels, giving you a rough and a more precise description of the file’s type.

So there are three distinct systems around now. And the computer already has some ways to translate between them. In fact, this is something which already worked in System 7.5 with PC Exchange: Using a MS-DOS disk on your computer and copying a file without a file name extension to it, would automatically add the relevant extension to the file name (in case PC Exchange knew about it); Similarly when using a MS-DOS disk, the system would match the file name extensions to the Mac’s own Types, display the correct icons and open the file with the correct application. The same scheme was introduced for internet downloads by the great Internet Config which later became a part of the System’s Internet preference pane – and still may be around hidden in OS X for use by legacy applications like Internet Explorer.

The situation we are in now looks like Apple wants to abandon their classical file Types which require a Mac hard drive to work and which aren’t perfect. In addition, everybody else works with file name extensions, which seriously suck, or MIME-Types which are a bit, but not much better. The great new idea is to introduce a new system, UTIs, which solves the problems with ambiguities we see in the other concepts and that will work together with the existing systems. All while remaining relatively simple.

To avoid the ambiguity of names, everybody can just make up their own names. All right, that wouldn’t work… so you’re asked to use a ‘reverse-domain-name’ notation. E.g. RechnungsChecker’s saved files have a net.earthlingsoft.rechnungschecker.rechnung UTI. So it’s highly unlikely that you’ll have a UTI which isn’t unique and the very same system has already been used quite successfully to identify applications on OS X and also to identify things in Java. It also makes sure that every developer can get all the UTIs he needs without having to go through a potentially lengthy process of checking whether they’re still available and then claiming them. I.e. it’s easy.

To avoid having information that is too coarse, every developer can specify a list of other UTIs that his UTI conforms to. To give you a good start with that, the system provides a number of existing UTIs like public.xml for XML files. As Rechnungs Checker’s files are just XML files, I specified my UTI to conform to that UTI. As the public.xml UTI itself is specified to conform to public.text you can see a nice hierarchy building here, making it possible, say, for a text editor to say it is able to open my files as well because they are text files. As any UTI can conform to several other UTIs, this can be fairly complex and wide ranging.

As UTIs are still very new, I’m not sure everything with them is settled yet. There seem to be some UTIs missing (none for property lists, it seems, just XML) and as a file just can have a single UTI it looks like it may turn out to be quite laborious to set up UTIs that reflect a file’s HTML version, text encoding and, say, PHP content. But right now, that probably isn’t a big deal as the whole UTI system seems to be introduced quietly anyway and will have time to mature and grow.

A neat thing Apple does to force people to start using UTIs is that making a Spotlight plugin requires their use. Every Spotlight plugin contains a list of UTIs it can import. But as UTIs don’t actually exist right now, you’ll also have to tell the System which existing Mac Types, MIME-Types or file name extensions match your UTI. And once those translations are set up (in the Info.plist files of applications or Spotlight importers) things will work.

And for things to start smoothly, Apple already provides just short of 200 UTIs in the CoreTypes.bundle (which you could easily find if Spotlight were worth its money) to identify its own files and get things right for some third party ones.

Sorry for the lengthy introduction, but we’ll see a few UTIs in what follows so it’s better to have an idea of what they’re about. For more information read Apple’s developer stuff or ars technica.

Warning

If you actually want to do the steps I describe below, you will need to be a user with administrator rights on the computer you are using as this involves manipulating a file in the System folder. This also means that you’ll first want to make a backup of anything you change and be prepared to take the blame for anything that goes wrong. I won’t take any of it. This is a ‘use at your own risk’ thing.

Terminal Icon Furthermore — this won’t be pretty. You’ll have to use the Terminal. If you are not comfortable with that, stop now, rather than complaining later. I’m sure it wouldn’t be too hard to make some Automator actions to make it more graphical… but I don’t see how that’d be worth the effort. So whenever I talk about running commands that has to be done in the Terminal.

Analysing your Files

Your good friend for analysing files with respect to their properties for Spotlight is the mdimport command. It’s the very same command that the System runs on every file that changes on your computer to update the Spotlight index. You can just run the mdimport command and pass it a file name to re-import any file – which in the theory of Spotlight shouldn’t be necessary, but may very well be needed in some situations. That won’t help you much, though, as the command gives you exactly no feedback if everything goes smoothly.

However, there is a handy developer option for the mdimport command. Running mdimport -d1 instead of just mdimport will tell you which file is imported, the file’s UTI and the Spotlight importer used to index the file. E.g.

[kalle:~] ssp% mdimport -d1 /Users/ssp/Documents/WG/2005-06 
2005-06-16 01:14:52.741 mdimport[5735] 
    Import '/Users/ssp/Documents/WG/2005-06' 
    type 'net.earthlingsoft.rechnungschecker.rechnung' 
    using 'file://Applications/Rechnungs%20Checker.app/Contents/
        Library/Spotlight/Rechnungs%20Checker.mdimporter/'

This is generally helpful as you can use it to figure out what type Spotlight considers the file to be of and which Spotlight importer it is associated to. Strange files may end up being of an ‘unknown’ type which usually looks like dyn.le77ersandnumbers. You’ll also get the note that no appropriate importer could be found in that case.

You can also get the note that no importer could be found when a proper UTI exists. Which is exactly what happens for Illustrator files (whose name ends in ‘.ai’). mdimport’s output will include no mdimporter, meaning that no Spotlight importer for its UTI com.adobe.illustrator.ai-image could be found.

Fixing the Problem

Now that we know the problem, we can try to ‘fix’ it. Assuming that those Illustrator files are actually PDF files – which somehow I fear they may not always be and thus the people at Apple chose not to import them by default with that importer – all we need to do is to tell the PDF Spotlight plugin which lives in the System Folder’s Spotlight folder that it can not only import PDF files, i.e. files of the UTI com.adobe.pdf, but also Illustrator files, i.e. files of the UTI com.adobe.illustrator.ai-image.

To do that, I highly recommend using SubEthaEdit as it has a nice graphical UI for all the freaky stuff we need to do. Using SubEthaEdit’s open dialogue and making sure the ‘Show Bundles as Folders’ checkbox at the bottom is checked, navigate to System → Library → Spotlight → PDF.mdimporter → Contents. Then open the Info.plist file there. Quite at the beginning of the file you’ll see

<key>CFBundleDocumentTypes</key>
<array>
    <dict>
        <key>CFBundleTypeRole</key>
        <string>MDImporter</string>
        <key>LSItemContentTypes</key>
        <array>
            <string>com.adobe.pdf</string>
        </array>
    </dict>
</array>

That’s just the part of the file which tells the system which UTIs the importer can handle. So all we need to do is insert the UTI for Illustrator files, so you’ll want to replace this by

<key>CFBundleDocumentTypes</key>
<array>
    <dict>
        <key>CFBundleTypeRole</key>
        <string>MDImporter</string>
        <key>LSItemContentTypes</key>
        <array>
            <string>com.adobe.illustrator.ai-image</string>
            <string>com.adobe.pdf</string>
        </array>
    </dict>
</array>

And then you’ll want to remember the bit about making a backup copy of the original and needing administrator privileges as SubEthaEdit will need you to have those – and enter your password – to save the file again. After having done that, any Illustrator files you save should be indexed by the PDF importer.

Cleaning Up

As you’ll probably have non-indexed Illustrator files on your drive which should be in the index as well, let’s just add those. The simple way to do that is to just tell the PDF importer to re-scan all the documents it can handle. This is done using the

mdimport -r /System/Library/Spotlight/PDF.mdimporter

command. As the plugin now handles Illustrator as well as PDF files, that’ll mean that all the PDF files wil be re-indexed as well. So be warned if you have many of those – it may keep the computer busy for quite a while. (There are other alternatives for importing the Illustrator files, like finding all of them and then dragging them to a Terminal window or using Automator for this… but this is the easiest one).

Final Note

While this was an easy thing to do, keep in mind that not every Illustrator file may be PDF compatible in a way that importing works. So the data you get may not be terribly reliable and files may not be found after all. It may be worth checking that on a number of files by running mdimport -d3 on some files in the Terminal, digging through the cryptic output and seeing whether you find all the files’ text in there.

Also note that upcoming system updates may update the PDF Spotlight plugin and delete your changes with that. Be sure to check that imports of Illustrator files still work immediately after each update if you want to have a chance that your files are indexed.

Ultimately, Adobe will have to provide an importer of their own that knows about the intricacies of the Illustrator file format and that can be expected to work reliably. Once that arrives, you’ll have to undo the changes you made to the existing PDF importer by removing the line that references Illustrator, to avoid any ambiguities about which importer is importing which files.

June 18, 2005, 23:11

Tagged as Mac OS X.

Comments

Comment by d.w.: ¶

Excellent tutuorial, Sven. Your caveat about Illustrator files is an appropriate one — sometime back in the early 90’s the Windows guys at a company I worked for were using Illustrator, and I’m pretty sure it was quite easy for them to produce “.ai” files that were very far from being conforming PDFs. Anyone with older Illustrator files hanging around (and Windows ones would be quite likely to have that extension) should be careful.

June 18, 2005, 23:59

Comment by ssp: ¶

Thanks Dave!

That may be the reason for not including all Illustrator files (I used FreeHand back then, so I don’t know about what Illustrator did).

I guess this is another hint at how daft and limited file name ‘extensions’ are for specifiying the type of a file.

I wonder whether it would be feasible to use the file tool in ambiguous situations. That would give correct results but possibly be quite slow once you need it for many files.

[kalle:~] ssp% file /Users/ssp/Desktop/SpotTested.ai 
/Users/ssp/Desktop/SpotTested.ai: PDF document, version 1.4

June 19, 2005, 11:54

Comment by Dave2: ¶

I finally got around to trying this… thanks! The results are a bit mixed, and has me wishing that Adobe would add a flag to the metadata so that AI files which are saved with PDF compatibility can be separated out for indexing.

June 23, 2005, 16:02

Comment by ssp: ¶

Exactly the point I’m trying to make… the existing typisations of files are too rough. So you either need Adobe to make their own importer or to have more finely grained file types.

I suppose the importer not being able to import some files won’t hurt too much, though.

June 23, 2005, 17:35

Comment by Mike A: ¶

It is ironic that you recommend SubEthaEdit for this job, as SubEthaEdit files are currently saved in a manner that does not allow them to be indexed by Spotlight!! (if the SEE devs would allow you to add .txt to the end of each saved file, that would solve the problem) But for now I am still using SEE anyway…

July 7, 2005, 22:29

Comment by Sören Kuklau: ¶

if the SEE devs would allow you to add .txt to the end of each saved file, that would solve the problem

What’s stopping you from saving a .txt file in SEE? I can’t remember ever having problems with that.

July 8, 2005, 0:56

Comment by blubbernaut: ¶

I wondering how one goes about finding the UTI descriptor for files that mdimport regards as unknown returning a dyn.blahblahnumbers.

I have a host of backedup .eml files dragged from Entourage, and have determined that if I change the type and creator to a rich text format and also change the extension to .txt, then Spotlight will search inside them, however then I can’t just double-click them to open automagically back in Entourage.

If I could at .eml UTI to Spotlight’s Rich Text plugin, then potentially I could have it index all my .eml files on the fly.

February 16, 2006, 6:54

Comment by laurence: ¶

Since I imported Entourage 11.3.3 in my Macbook running on OS 10.4.8, i have lost spotlight in preferences—> general . Do I need to do some coding to get it back??

Thanks very much for getting back to me.

nb. I know i am probably not in the right place, but have looked everywhere and this is the closest have come from my problem.

Any help will be much appreciated.

cheers

February 17, 2007, 10:33

Quarter Life Crisis

The world according to Sven-S. Porst