ASCII filename normalization

Sandvox uses file packages for its document format. Each bit of media required by the document on disk as its own file, inside the package.

When adding media, we try to respect its existing filename. Historically, this has simply involved ensuring it’s unique (and if not, appending a number on the end until it is). Granted, we could simply store media by something like a UUID or hash. But it’s been nice to keep the names as human-friendly as possible for debugging purposes, and when dragging media back out of the app.

Recently though I had a case which exposed a bit of a flaw in our system. Should a file/directory be transferred to a filesystem which doesn’t support the full range of Unicode characters, it seems OS X will adjust filenames to best suit the target disk as needed. This is great for regular folders, but pretty bad for file packages in apps like Sandvox!

If a document gets transferred and contains files whose names need adjusting, Sandvox is then left unable to locate the media files after their rename. What a vexing problem! I think iWork’s switch to a zip-based document format around 2008 or so makes quite a bit of sense in this light!

It seemed to me the best solution available to us for this is to do a little more work upfront and assign the most broadly compatible filenames available to new media. i.e. make them pure ASCII. But how to do this neatly? Here we go:

Yep, gotta love a bit of CFStringTransform. CoreFoundation makes available constants to do what we want (kCFStringTransformToLatin and kCFStringTransformStripCombiningMarks). But the docs also have to say:

On OS X v10.4 and later, you can also use any valid ICU transform ID defined in the ICU User Guide for Transforms.

Happily, a reader reports this feature applies to iOS too. And so we turn to the ICU user guide, and discover the raw transform IDs, and — nicely — that we can chain them together (using a semicolon to separate transform names).

I’ve don’t know if this is actually more efficient than making two separate calls, but it’s less code, and nicely gives Apple a little more context should they optimise more in the future. Update: @geheimwek tells me that he’s found custom transforms to be pretty slow, so you should probably test with both should you have a performance sensitive situation.

© Mike Abdullah 2007-2015