What characters are allowed unencoded in query strings?

A couple of months ago I advised people to Be careful with non-ascii characters in URLs. We’ve been discussing that at work lately, more specifically whether characters like “:” and “/” are allowed unencoded in query strings or not.

I may well have made mistakes trying to understand the specification, so any help clarifying any errors in the following would be appreciated.

The summary of my previous post is this:

In essence this means that the only characters you can reliably use for the actual name parts of a URL are a-z, A-Z, 0-9, -, ., _, and ~. Any other characters need to be Percent encoded.

But what about those query strings? After studying RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax I’ve come to the following conclusions.

In section 2.2 Reserved Characters, the following characters are listed:

reserved = gen-delims / sub-delims

gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

The spec then says:

If data for a URI component would conflict with a reserved character’s purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.

Next, in section 2.3 Unreserved Characters, the following are listed:

unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"

Ok, so let’s look at what is allowed in the path component of a URL. Section 3.3 Path has a bunch of rules that should be used by URI parsers. The last rule defines which characters are allowed:

pchar = unreserved / pct-encoded / sub-delims / ":" / "@"

Unreserved, percent-encoded, sub-delimiters, “:”, and “@”. Seems pretty clear.

What about the query component then? According to section 3.4 Query, these characters are allowed:

query = *( pchar / "/" / "?" )

Ok, so from the earlier definition of “pchar” we have unreserved, percent-encoded, sub-delimiters, “:”, and “@”. And for query strings “/” and “?” are allowed as well.

The conclusion is that something like http://example.com/document/?uri=http://user:password@example.com/?foo=bar is valid, since “/” and “?” do not need to be percent encoded in query strings, and neither do “:” and “@”.

Did I get it right? If not, a comment explaining where I’m mistaken would be much appreciated.

Posted on August 30, 2010 in Web Standards