What characters are allowed unencoded in query strings?

A couple of months ago I advised people to Be careful with non-ascii characters in URLs. We’ve been discussing that at work lately, more specifically whether characters like “:” and “/” are allowed unencoded in query strings or not.

I may well have made mistakes trying to understand the specification, so any help clarifying any errors in the following would be appreciated.

The summary of my previous post is this:

In essence this means that the only characters you can reliably use for the actual name parts of a URL are a-z, A-Z, 0-9, -, ., _, and ~. Any other characters need to be Percent encoded.

But what about those query strings? After studying RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax I’ve come to the following conclusions.

In section 2.2 Reserved Characters, the following characters are listed:

reserved = gen-delims / sub-delims

gen-delims = “:” / “/” / “?” / “#” / “[” / “]” / “@”

sub-delims = “!” / “$” / “&” / “’” / “(” / “)” / “*” / “+” / “,” / “;” / “=”

The spec then says:

If data for a URI component would conflict with a reserved character’s purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.

Next, in section 2.3 Unreserved Characters, the following are listed:

unreserved = ALPHA / DIGIT / “-” / “.” / “_” / “~”

Ok, so let’s look at what is allowed in the path component of a URL. Section 3.3 Path has a bunch of rules that should be used by URI parsers. The last rule defines which characters are allowed:

pchar = unreserved / pct-encoded / sub-delims / “:” / “@”

Unreserved, percent-encoded, sub-delimiters, “:”, and “@”. Seems pretty clear.

What about the query component then? According to section 3.4 Query, these characters are allowed:

query = *( pchar / “/” / “?” )

Ok, so from the earlier definition of “pchar” we have unreserved, percent-encoded, sub-delimiters, “:”, and “@”. And for query strings “/” and “?” are allowed as well.

The conclusion is that something like http://example.com/document/?uri=http://user:password@example.com/?foo=bar is valid, since “/” and “?” do not need to be percent encoded in query strings, and neither do “:” and “@”.

Did I get it right? If not, a comment explaining where I’m mistaken would be much appreciated.

Posted on August 30, 2010 in Web Standards

Comments

  1. I’ve got a tvguide site and I have something like:

    /program/februari22010/channel/21:30+the+news

    Sometimes even a program like:

    /program/februari22010/channel/12:23+z@pp+sport

    I tested this is many browsers and still have to find a browser who doesn’t support it. And Google indexes all those pages correctly.

    The only thing annoying is when people post a link of my site on a bulletin board, it automatically cuts it of at the “:” So I end up with visits to /program/februari22010/21

  2. My experience is that the underscore character in URL’s, most notably in subdomain names, can cause all sorts of browser havoc; that’s probably because the _ character is NOT valid in domain names. Webkit, Gecko and Trident based browsers all refuse to hand over any cookies for subdomains with the underscore character in them. I’ve been bitten by this many a time.

    See this post at dns.net for more information on legal hostnames.

  3. This is correct for the generic URI syntax, but it’s worth noting that when the query string comes from an HTML form there other requirements that come into play — consider, for example, if your values contain ‘&’; you’d end up with something like “?field=Jim&John&field2=value2”, which doesn’t make much sense.

    The rules for encoding key-value pairs into a query string (the content type is ‘application/x-www-form-urlencoded’) are given in the HTML 4.01 spec.

  4. Absolutely correct. However, notice that a lot of popular software does not always accept these correct URIs, and even generates incorrect URIs. For example, the “array syntax” used in query strings by so much software is actually invalid:

    http://example.com?abc[]=123&abc[]=456

    As you can see, the square brackets are listed in gen-delims and there’s no exception listed for them in the pchar or query production. Browsers are no better; they should encode them before sending them to the server, but I just tested with Firefox and it does not do this.

    So you should be careful in what URIs you generate, since there’s a whole lot of broken software out there. The forum example by Daan (comment #1) is a good one to serve as a warning.

    However, just encoding everything to be “safe” isn’t the way to go either because I know there’s at least some software out there that doesn’t properly normalize URIs, so a fully percent-encoded URI would be considered a different URI from the maximally decoded version, even though the spec says they’re the same according to the normalization algorithm.

    As always, just test your code against the software you’re going to use. That’s the ultimate acceptance test :)

  5. August 31, 2010 by Roger Johansson (Author comment)

    Porges: Yep, one more thing to keep in mind.

    Peter:

    So you should be careful in what URIs you generate, since there’s a whole lot of broken software out there.

    Indeed. I prefer trying to stay on the safe side and avoid anything that can cause problems as much as possible.

  6. The spec seems rather hard to figure out. From reading the links posted I’m left wondering if using sub-delims are allowed in paths or not. Any ruling on that?

  7. September 2, 2010 by Roger Johansson (Author comment)

    Peter:

    From reading the links posted I’m left wondering if using sub-delims are allowed in paths or not.

    As far as I can tell, yes, they are.

  8. Interesting and timely. I had to redo some outbound links to a Spanish site last week because IE-8 did not like the spaces and foreign language characters in the url. What was surprising is that all the other major browsers including IE7 had no complaints about the links.

  9. I found out the hard way through a bugreport for an URI library of mine that the x-www-form-urlencoded spec in the HTML standard and the XForms part of XHTML actually require you to over-encode. It wants you to encode anything except the unreserved characters, which is a major pain; there is common server software (Ruby on Rails in this case) that doesn’t allow less strictly encoded URI strings. So the URI spec is actually allowing more than the HTML standard allows, which is rather odd. I also don’t quite understand the silly requirement that spaces be encoded as a ‘+’-sign. %20 works just fine for that!

    There was a draft RFC which never got finalized that tried to get a registered standard mimetype called application/www-form-urlencoded (so, without the x- prefix), which got rid of this pointless restriction and made it more consistent with the URI spec. Too bad that it never really got pushed through the process! Even if we’d have to deal with legacy software for a while the situation would be better in the long run.

Comments are disabled for this post (read why), but if you have spotted an error or have additional info that you think should be in this post, feel free to contact me.