Escaping URL host names in Cocoa

Having covered URL paths and queries, it's time to wind back a little and take a look at the host component of a URL:

scheme://host/path


Happily for us, the host is generally a domain name, something along the lines of example.com. Literal IPv4 addresses aren't entirely uncommon either (e.g. 1.2.3.4). Such strings are fine as-is without any escaping. Hooray, job done, time to go home.


But if you're feeling more adventurous, you might wish to encode arbitrary host names outside of the standard ASCII set. This is likely for something outside of the commonly found schemes on the web (http etc.), and we turn again to our old friend percent-encoding:

Starting with the more obvious, the characters /, ?, and # must be escaped, to avoid anything which follows them from being mistaken as a path, query or fragment. The : character is in a similar position since it could be mistaken as the start of a port component. And finally @ symbols must be escaped so none of the host is mistaken for a username or password.

IP Literals

Now it starts to get complicated. Sometimes IPv6 addresses to represented in a URL. Or indeed there may be some future address scheme that needs to be accounted for too. RFC3986 sets aside a literal syntax for this:

http://[::FFFF:129.144.52.38]:80/index.html


The square brackets (which would normally be percent encoded elsewhere in a URL) indicate this special syntax.

Most importantly, note the use of : characters in the example above. One is used after the brackets to indicate the port number. But what about this within the brackets? As we saw above, : characters must usually be escaped as part of a hostname, but for literal addresses, the brackets act as an exception to this rule.

Thus a truly general purpose algorithm for encoding URL host names needs to special-case address literals. Too tricky to be worth a Gist in my opinion. Instead, use -[KSURLComponents setHost:] and those nasty details will be taken care of for you.

International Domain Names

For further complication, websites can make use of international domain names, whereby characters outside of the regular ASCII set are encoded using a system known as punycode. e.g. the domain exämple.com is actually addressed as xn--exmple-cua.com. Web browsers take care of hiding this detail from users by presenting (and accepting) the unicode form of such domains.

Handling this encoding is a tricky business, but fortunately one we probably only face when accepting user input of URLs, or formatting them for display. Rather than attempt to write my own, there’s a few options available, which I’ve blogged about before.

© Mike Abdullah 2007-2015