Be careful with non-ascii characters in URLs

One thing that is rather common, especially on websites whose content is not in English, is URLs that contain unencoded characters such as space, å, ä, or ö. While this works most of the time it can cause problems.

Looking at RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax, characters that are allowed to be used unencoded in URLs are either reserved or unreserved. The unreserved characters are a-z, A-Z, 0-9, -, ., _, and ~. The reserved characters are used as delimiters and are :, /, ?, #, [, ], @, !, $, &, ', (, ), *, +, ,, ;, and =.

In essence this means that the only characters you can reliably use for the actual name parts of a URL are a-z, A-Z, 0-9, -, ., _, and ~. Any other characters need to be Percent encoded.

So what does using an unencoded character like space or å in a URL mean in practice?

To keep things simple and predictable, consider sticking to the unreserved characters in URLs unless you have a really strong internationalisation requirement for using other characters.

This post is a Quick Tip. Background info is available in Quick Tips for web developers and web designers.

Posted on June 14, 2010 in Web Standards, Usability, Quick Tips, Browsers