Content negotiation

Note: There is a follow-up to this article: Content negotiation, AdSense, and comments.

As most of you probably know, XHTML should be served with a MIME type of application/xhtml+xml. Since not all browsers understand that MIME type, content negotiation can be used to send different MIME types to different browsers. Here’s what I ran into when I did that.

Since day one, this site has been marked up with valid XHTML 1.0 Strict. However, all documents have been served with the MIME type text/html, and that is not the ideal way of serving XHTML, which should be served as application/xhtml+xml according to the (non-normative) W3C Note XHTML Media Types. There are different opinions on this; some argue that it’s OK to use text/html, as long as the XHTML is “HTML compatible”, others will say serving XHTML as text/html is completely invalid, evil, and almost worse than 20th century tag soup. I’m not going to take sides.

Anyway, in order to do the rightest thing I wanted to start using application/xhtml+xml. Since not all web browsers handle that MIME type, I started looking for ways of serving documents with different MIME types depending on the capabilities of the requesting user agent. After some googling I found a simple PHP script that will let me do that:

  1. <?php
  2. if (stristr($_SERVER[HTTP_ACCEPT], "application/xhtml+xml") || stristr($_SERVER["HTTP_USER_AGENT"],"W3C_Validator")) {
  3. header("Content-Type: application/xhtml+xml; charset=iso-8859-1");
  4. header("Vary: Accept");
  5. echo("<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n");
  6. }
  7. else {
  8. header("Content-Type: text/html; charset=iso-8859-1");
  9. header("Vary: Accept");
  10. }
  11. ?>

Note that this is a simplified script which does not take the requesting user agent’s q-rating into account. Keep reading for more info on that.

You may be wondering why I’m using the character encoding iso-8859-1, and not UTF-8. Well, that’s on the to-do list. I’ve run into some weird problems trying to switch to UTF-8, but if I can just figure out how to do it correctly and it doesn’t create any serious compatibility problems, I’ll switch.

In order to make Apache run PHP scripts in .html files I added the following at the top of my .htaccess file:

  1. RemoveHandler .html .htm
  2. AddType application/x-httpd-php .html

I actually did that a long time ago, when I installed Refer, but in case your server isn’t configured to run PHP scripts in .html files, now you know how you can change that. If your server is running Apache, that is.

The script checks if the user agent sends an Accept HTTP header that contains the value “application/xhtml+xml”, or if the user agent is the W3C HTML Validator, which does not send a proper Accept HTTP header but still handles application/xhtml+xml. If either of those are true, the document is served as application/xhtml+xml. Those browsers are also sent an XML declaration. To other browsers, including all versions of Internet Explorer, the document is served as text/html. No XML declaration is added to the document, since that would put IE/Win into Quirks mode, and I don’t want that.

After the Content-Type header, a Vary header is sent to (if I understand it correctly) tell intermediate caches, like proxy servers, that the content type of the document varies depending on the capabilities of the client which requests the document. Thanks to Simon Jessey for the tip.

For a more advanced PHP content negotiation script, visit Serving up XHTML with the correct MIME type. That script takes the requesting user agent’s q-rating (how well it claims to handle a certain MIME type) into account, and converts XHTML to HTML 4 before sending it as text/html to user agents that don’t handle application/xhtml+xml.

If you’re using MS IIS instead of Apache, the following ASP script will do what the PHP script above does:

  1. <%
  2. If InStr(Request.ServerVariables("HTTP_ACCEPT"), "application/xhtml+xml") > 0 Or InStr(Request.ServerVariables("HTTP_USER_AGENT"), "W3C_Validator") > 0 Then
  3. Response.ContentType = "application/xhtml+xml"
  4. Response.Write("<?xml version=""1.0"" encoding=""iso-8859-1""?>" & VBCrLf)
  5. Else
  6. Response.ContentType = "text/html"
  7. End If
  8. Response.Charset = "iso-8859-1"
  9. Response.AddHeader "Vary", "Accept"
  10. %>

If you prefer ASP.NET, here’s a script provided by Justin Perkins:

  1. string http_accept = Request.ServerVariables["HTTP_ACCEPT"];
  2. string http_user_agent = Request.ServerVariables["HTTP_USER_AGENT"];
  3. if (((http_accept != null) && (http_accept.ToLower().IndexOf("application/xhtml+xml") > 0)) || ((http_user_agent != null) && (http_user_agent.ToLower().IndexOf("w3c_validator") > -1))){
  4. Response.ContentType = "application/xhtml+xml";
  5. Response.Write("<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n");
  6. }
  7. else{
  8. Response.ContentType = "text/html";
  9. }
  10. Response.Charset = "iso-8859-1";
  11. Response.AddHeader("Vary", "Accept");

One real-world benefit of sending XHTML documents as application/xhtml+xml is that Gecko-based browsers like Firefox and Mozilla will display an error message if there are any errors in the markup. No need to use the W3C validator for that. That benefit is also the only drawback; if a document is invalid, as can happen when XHTML is allowed in comments, Gecko browsers just won’t display it until you notice the problem and edit the comment containing the invalid XHTML. Ouch.

Currently not all pages on this site use the PHP script above to determine which MIME type to use. There are two reasons:

  1. Comments. To avoid potential problems with invalid XHTML in comments I am still serving any pages that contain comments as text/html to all browsers. I’m going to leave it that way until I can wrap my head around a way to make sure comments are valid. I have found a Movabletype plugin called MTValidate, and Safe HTML checker, a script by Simon Willison, but like I said I can’t quite figure out how to use either of them. If anyone has a nice step-by-step tutorial on ensuring that comments are valid XHTML with Movabletype, please let me know.
  2. Google AdSense. I decided to have a go at displaying Google ads on the site. It works by adding a bit of JavaScript to each page that should display the ads. That code in itself is valid XHTML 1.0 Strict. Unfortunately, the JavaScript loads the ads by creating an <iframe>, which is not allowed in XHTML 1.0 Strict. Most browsers ignore that and happily display the <iframe> anyway, but browsers based on Gecko display nothing where the ads are supposed to be. I thought switching to XHTML 1.0 Transitional would fix that, since that DOCTYPE allows the <iframe> element, but no. I also had to go back to serving all documents that contain Google ads as text/html. Anyone know what’s going on with that? Am I wrong in thinking that XHTML 1.0 Transitional should allow transitional elements and attributes even when it is served as application/xhtml+xml? I haven’t looked at the AdSense JavaScript in detail, but I suppose the problem could be caused by invalid attributes in the <iframe>-tag.

I’d like to solve those two problems and be able to put the content negotiation script on all pages, and any pointers in the right direction would be much appreciated.

Update: As if he was reading my mind, Ian Hickson posted Why document.write() doesn’t work in XML just a few days before I posted this. Since the Google AdSense script uses document.write() to create the <iframe>, it won’t work in documents served as application/xhtml+xml. Pretty obvious once you think about it. That leads to the next question: is there something that could be used instead of document.write(), and could be suggested to Google?

Update 2: I have updated the PHP content negotiation script to send a Vary header. Not sure how to do that with ASP, so I’d appreciate if somebody could fill me in. I also added a link to the more advanced script at Keystone Websites.

Update 3: I have contacted Google AdSense tech support about the document.write + XHTML problem, and was informed that their engineers are currently working on a solution to the problem.

Update 4: Simon Jessey has come up with a workaround which is looking pretty good. An explanation can be found in Making AdSense work with XHTML.

Posted on August 8, 2004 in (X)HTML

Comments

  1. In firefox you can right click the google ad and select: This Frame > Open Frame in New Tab. You can then validate this page which would tell you if thats the problem.

    I’ve just done this, and it fails to validate. But should this stop the main page thats calls the iFrame from validating??

  2. Note that the document you are pointing to is in fact a W3C NOTE and not normative in any way (non-normative would be the appropriate term, you should not quote from it, without mentioning that). According to RFC 3023 IIRC ‘application/xhtml+xml’ MUST be used.

  3. xHTML documents follow a different standard when using javascript. You will find that document.write statements will not work whatsoever.

    I’m not sure what the alternative is because I stopped dabbling with ECMAscript a while back because i prefer server side scripting so I’m rusty at it nowadays.

  4. August 9, 2004 by Roger (Author comment)

    Anne: Here are two quotes from the document I linked to:

    The ‘application/xhtml+xml’ media type [RFC3236] is the media type for XHTML Family document types, and in particular it is suitable for XHTML Host Language document types.

    and

    ‘application/xhtml+xml’ SHOULD be used for serving XHTML documents to XHTML user agents. Authors who wish to support both XHTML and HTML user agents MAY utilize content negotiation by serving HTML documents as ‘text/html’ and XHTML documents as ‘application/xhtml+xml’.

    According to that, sending XHTML documents as ‘text/html’ to HTML user agents is not OK, which I’m guessing is your standpoint. Other W3C documents state that it is OK if you follow their HTML Compatibility Guidelines .

    Like I said, there are different opinions on this.

    I skimmed through RFC 3023, but can’t find any mention of ‘application/xhtml+xml’. Could you provide some more details?

  5. August 9, 2004 by Roger (Author comment)

    Shunuk: That explains it. Looks like you have to choose between AdSense and application/xhtml+xml :(.

  6. I guess you’ll be interested in seeing Why document.write() doesn’t work in XML then…

  7. Hi, Roger. If you are going to be doing any kind of content negotiation, you really should say so in your header, with Vary: Accept. Furthermore, your current method of checking the Accept headers does not take into account the q-weighting given to each type. For more information, see Serving up XHTML with the correct MIME type - it describes a PHP solution that covers all bases that I’m aware of.

  8. August 9, 2004 by Roger (Author comment)

    Lolly: Thanks. That’s an excellent explanation of the problem. Does anyone know of a workaround that could be suggested to Google? Simon: I’ve seen that tutorial before, but it seemed a bit overkill to me at the time. I may reconsider that ;) The part about Vary: Accept is something I had missed though. I haven’t seen it in other articles about content negotiation. What exactly does it do?

  9. Roger, please note that that document is just a NOTE (I feel like I said this before) and it can’t represent the things you ought to do, since it isn’t a standard (thank god). Since it isn’t a standard you may not quote from it and saying this is what you should do, that is just wrong. The vary header can be used to say which file you would prefer. Some people don’t prefer XHTML at all and use ‘application/xhtml+xml;q=0’, which means that they don’t want XHTML at all cost. Your scripts however, will give it to them anyway and the page is borked. RFC 3236 doesn’t say anything either. It just defines the new media type, that’s all. Someone else addressed the Google Ads XHTML problem before, but with no luck. If they would use the DOM all would be fine, but no.

  10. The part about Vary: Accept is something I had missed though. I haven’t seen it in other articles about content negotiation. What exactly does it do?

    Basically, the Vary header informs the client (or an intermediate cache) that the content varies according the client’s request - in this case, the contents of its Accept header. This is especially useful in caching situations, but otherwise it is just good manners LOL. It doesn’t have to have it, but then you don’t have to have application/xhtml+xml either, so it seems appropriate to have one if you are going to have the other.

    The q-weighting thing probably is overkill, but I thought it was necessary to be thorough. It will probably come into its own with mobile devices and other esoteric user agents.

  11. August 9, 2004 by Roger (Author comment)

    Anne: I added a note about the note being a note. I actually failed to note that it was a note until now ;)

    Thanks for the link to Google AdSense and XHTML1.1. I’ll contact Google about it as well.

    Simon: Thanks for the explanation.

  12. Well, I’m in the same hole as well so I said “aww F**k it, I’m doing it myself.” Obviously Google’s gonna get pissed at me, but I’m rewriting their engine so it works with ALL website (= more money for them). I’m writing the DOM code (with little javascript or XML knowledgs, but enought to find out and learn!), and if anyone wants to help, email me!

  13. I’ve just figured out how to do the AdSense thing, but it isn’t very elegant. See AdSense for XHTML for details.

  14. August 22, 2004 by Roger (Author comment)

    That’s a nice trick, Simon. After reading your post I experimented a little, and found that you can use php to send the object element only to the same browsers that get application/xhtml+xml, and the normal JavaScript AdSense code only to text/html browsers. Seems like it works, and that removes the need for CSS tricks to hide one set of ads in some browsers.

    I wonder if the AdSense code likes being inserted via an object element though. Well, I guess one way of finding out is to try it :)

  15. A slight problem with your content negotiation script and Safari (Jaguar). If I add a normal link to certain websites and use Safari’s back-button to, uh, get back then the linking page, which is written in php/xhtml, breaks up! Safari says that there is a parse error. I haven’t yet figured out why, how or what happens.

  16. Correction to my earlier comment:

    “If I add a normal link to certain websites and use Safari’s back-button to..”

    Should be:

    “If I add a normal link to certain website and click the link in Safari and use Safari’s back-button to…”

    But I guess you caught it already. :)

  17. August 22, 2004 by Roger (Author comment)

    That Safari thing is not something I have noticed or heard of. It sounds very strange to me.

  18. Tell me about it! I tried out the Keystone Websites’ content negotiation script and everything is fine and dandy. I’m just guessing but maybe there was something completely wrong with the other site I linked and Safari just freaked out or something? Or maybe My code is crappy? Questions, questions…

  19. Just a quick note that it’s starting to look like the problem may be between with Safari and Wordpress that I’m using now, NOT your script. Sorry for the inconvenience. :)

  20. This is a fairly common problem, and one that won’t go away any time soon.

    It’s not helped by the HTTP_ACCEPT specs being complicated and not properly followed by UAs.

    In fact, an HTTP_ACCEPT header can contain “application/xhtml+xml”, but be followed by “;q=0.01”, as you explained. As the alternative is a “text/” type as opposed to an “application/” type, this makes it very hard, almost impossible, to judge what the user agent would rather receive with any degree of accuracy.

    Also, a user agent can, according to the specs, send alternative HTTP_ACCEPT headers, like “application/” (meaning that it could accept “application/xhtml+xml” without a problem, or even “/*” (as IE does because MS are too lazy to do things properly), which indicates any mime type is ok.

    And unfortunately, the above means that:

    1. Many user agents that can receive and handle “application/xhtml+xml” properly will be served “text/html”.

    2. Many user agents that can just about receive and comprehend “application/xhtml+xml” but would prefer “text/html” are served “application/xhtml+xml”.

    3. The most-used user agent (IE) lies about what it can and cannot receive and process properly.

    Which makes serving the correct mime types tricky at best. Personally, I’m working on a new version of my site. I use something similar to the above, with an IE-sniffer (because IE doesn’t send proper headers), and a small amount of “q” processing, but only very basic.

    I’m starting to think “text/plain” is the way to go…

  21. November 12, 2004 by matthijs

    Hi, just a question about this application versus text thing: why is it such a big problem to serve xhtml as text? I know you probebly don’t want to repeat the discussion here, but maybe a link to a place were this problem is discussed and explained, for someone like me who doesn’t understand this would be helpful! For what I know: i just picked up the zeldman-book about making websites (dwws) and learned that using the xml prolog is causing a lot of problems. With the current browsers anyway.

    Thanx for a reply, Matthijs, Holland

  22. November 18, 2004 by Steffan

    Without going so far as to worry about the possibility that UAs might use alternative HTTP_ACCEPT headers the following might prove useful:

    Dim objServerVariableRegExp, arrHttpAccept, i, lngHttpAcceptQuality(3) Set objServerVariableRegExp = New RegExp objServerVariableRegExp.Global = True objServerVariableRegExp.Pattern = “^(.∗?)((;q=)(.∗))?$”

    arrHttpAccept = split(Request.ServerVariables(“HTTP_ACCEPT”), “,”)

    For i = 1 to UBound(arrHttpAccept)

    lngHttpAcceptQuality(0) = objServerVariableRegExp.Replace(arrHttpAccept(i), “$4”)

    If lngHttpAcceptQuality(0) = “” then

    lngHttpAcceptQuality(0) = 1

    Else

    lngHttpAcceptQuality(0) = CSng(lngHttpAcceptQuality(0))

    End If

    Select Case objServerVariableRegExp.Replace(arrHttpAccept(i), “$1”)

    Case “application/xhtml+xml”

    lngHttpAcceptQuality(1) = lngHttpAcceptQuality(0)

    Case “text/html”

    lngHttpAcceptQuality(2) = lngHttpAcceptQuality(0)

    Case “∗/∗”

    lngHttpAcceptQuality(3) = lngHttpAcceptQuality(0)

    End Select

    Next

    If (lngHttpAcceptQuality(1) > lngHttpAcceptQuality(2) and lngHttpAcceptQuality(1) > lngHttpAcceptQuality(3)) Or InStr(Request.ServerVariables(“HTTPUSERAGENT”), “W3C_Validator”) > 0 Then

    Response.ContentType = “application/xhtml+xml”

    Response.Write “” & vbCrLf

    Else Response.ContentType = “text/html”

    End If

    Response.Charset = “iso-8859-1”

    Response.AddHeader “Vary”, “Accept”

    Hope you either find this useful or can point out any flaws in my approach.

  23. November 18, 2004 by Roger Johansson (Author comment)

    Matthijs: As you may have noticed, opinions vary on whether it actually is a problem or not. I’m hesitant to take a stand either way, so here are a couple of places to read more:

    Steffan: I haven’t tried the script but it looks interesting. Any ideas on how you could use ASP to convert XHTML to HTML 4 on the fly, similar to the PHP script described by Tommy Olsson in Content Negotiation?

  24. November 19, 2004 by Steffan

    Roger: I don’t think you can actually edit the response buffer in asp but the following will work:

    Having set Response.ContentType, you can build IF or SELECT statements to determine which content type is in force whenever you have blocks to handle differently (e.g. commenting with embedded css).

    In order to strip the forward slashes from empty elements you could load everything into a variable as you go rather than using response.write (i.e. strHTMLstream = strHTMLStream & “…”) then use a regex.replace on your variable before using response.write to pass it to the UA.

    This can apparently have a negative impact on performace though [although I haven’t tested it myself] as outlined in the An Efficient String Concatenation Component.

  25. December 13, 2004 by Steffan

    Just noticed that part of the script didn’t display - the last part should have read:

    If (lngHttpAcceptQuality(1) > lngHttpAcceptQuality(2) and lngHttpAcceptQuality(1) > lngHttpAcceptQuality(3)) Or InStr(Request.ServerVariables(“HTTPUSERAGENT”), “W3C_Validator”) > 0 Then

    Response.ContentType = “application/xhtml+xml”

    Response.Write “<?xml version=”“1.0”” encoding=”“iso-8859-1”“?>” & vbCrLf

    Else Response.ContentType = “text/html”

    End If

    Response.Charset = “iso-8859-1”

    Response.AddHeader “Vary”, “Accept”

    With regard to concatenating the page - I haven’t stipped forward slashes in this way (yet) but I have previously processed a number of large blocks to generate titles and links for words in the glossary on a site without any major slow down.

    When I get chance I’ll give this a try and speed test some static markup against server side generated stuff to see how bad the problem is on a reasonable length page.

Comments are disabled for this post (read why), but if you have spotted an error or have additional info that you think should be in this post, feel free to contact me.