Content negotiation

Note: There is a follow-up to this article: Content negotiation, AdSense, and comments.

As most of you probably know, XHTML should be served with a MIME type of application/xhtml+xml. Since not all browsers understand that MIME type, content negotiation can be used to send different MIME types to different browsers. Here’s what I ran into when I did that.

Since day one, this site has been marked up with valid XHTML 1.0 Strict. However, all documents have been served with the MIME type text/html, and that is not the ideal way of serving XHTML, which should be served as application/xhtml+xml according to the (non-normative) W3C Note XHTML Media Types. There are different opinions on this; some argue that it’s OK to use text/html, as long as the XHTML is “HTML compatible”, others will say serving XHTML as text/html is completely invalid, evil, and almost worse than 20th century tag soup. I’m not going to take sides.

Anyway, in order to do the rightest thing I wanted to start using application/xhtml+xml. Since not all web browsers handle that MIME type, I started looking for ways of serving documents with different MIME types depending on the capabilities of the requesting user agent. After some googling I found a simple PHP script that will let me do that:

<?php
if (stristr($_SERVER[HTTP_ACCEPT], "application/xhtml+xml") || stristr($_SERVER["HTTP_USER_AGENT"],"W3C_Validator")) {
	header("Content-Type: application/xhtml+xml; charset=iso-8859-1");
	header("Vary: Accept");
	echo("<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n");
	}
else {
	header("Content-Type: text/html; charset=iso-8859-1");
	header("Vary: Accept");
	}
?>

Note that this is a simplified script which does not take the requesting user agent’s q-rating into account. Keep reading for more info on that.

You may be wondering why I’m using the character encoding iso-8859-1, and not UTF-8. Well, that’s on the to-do list. I’ve run into some weird problems trying to switch to UTF-8, but if I can just figure out how to do it correctly and it doesn’t create any serious compatibility problems, I’ll switch.

In order to make Apache run PHP scripts in .html files I added the following at the top of my .htaccess file:

RemoveHandler .html .htm
AddType application/x-httpd-php .html

I actually did that a long time ago, when I installed Refer, but in case your server isn’t configured to run PHP scripts in .html files, now you know how you can change that. If your server is running Apache, that is.

The script checks if the user agent sends an Accept HTTP header that contains the value "application/xhtml+xml", or if the user agent is the W3C HTML Validator, which does not send a proper Accept HTTP header but still handles application/xhtml+xml. If either of those are true, the document is served as application/xhtml+xml. Those browsers are also sent an XML declaration. To other browsers, including all versions of Internet Explorer, the document is served as text/html. No XML declaration is added to the document, since that would put IE/Win into Quirks mode, and I don’t want that.

After the Content-Type header, a Vary header is sent to (if I understand it correctly) tell intermediate caches, like proxy servers, that the content type of the document varies depending on the capabilities of the client which requests the document. Thanks to Simon Jessey for the tip.

For a more advanced PHP content negotiation script, visit Serving up XHTML with the correct MIME type. That script takes the requesting user agent’s q-rating (how well it claims to handle a certain MIME type) into account, and converts XHTML to HTML 4 before sending it as text/html to user agents that don’t handle application/xhtml+xml.

If you’re using MS IIS instead of Apache, the following ASP script will do what the PHP script above does:

<%
If InStr(Request.ServerVariables("HTTP_ACCEPT"), "application/xhtml+xml") > 0 Or InStr(Request.ServerVariables("HTTP_USER_AGENT"), "W3C_Validator") > 0 Then
	Response.ContentType = "application/xhtml+xml"
	Response.Write("<?xml version=""1.0"" encoding=""iso-8859-1""?>" & VBCrLf)
Else
	Response.ContentType = "text/html"
End If
Response.Charset = "iso-8859-1"
Response.AddHeader "Vary", "Accept"
%>

If you prefer ASP.NET, here’s a script provided by Justin Perkins:

string http_accept = Request.ServerVariables["HTTP_ACCEPT"];
string http_user_agent = Request.ServerVariables["HTTP_USER_AGENT"];
if (((http_accept != null) && (http_accept.ToLower().IndexOf("application/xhtml+xml") > 0)) || ((http_user_agent != null) && (http_user_agent.ToLower().IndexOf("w3c_validator") > -1))){
	Response.ContentType = "application/xhtml+xml";
	Response.Write("<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n");
}
else{
	Response.ContentType = "text/html";
}
Response.Charset = "iso-8859-1";
Response.AddHeader("Vary", "Accept");

One real-world benefit of sending XHTML documents as application/xhtml+xml is that Gecko-based browsers like Firefox and Mozilla will display an error message if there are any errors in the markup. No need to use the W3C validator for that. That benefit is also the only drawback; if a document is invalid, as can happen when XHTML is allowed in comments, Gecko browsers just won’t display it until you notice the problem and edit the comment containing the invalid XHTML. Ouch.

Currently not all pages on this site use the PHP script above to determine which MIME type to use. There are two reasons:

Comments. To avoid potential problems with invalid XHTML in comments I am still serving any pages that contain comments as text/html to all browsers. I’m going to leave it that way until I can wrap my head around a way to make sure comments are valid. I have found a Movabletype plugin called MTValidate, and Safe HTML checker, a script by Simon Willison, but like I said I can’t quite figure out how to use either of them. If anyone has a nice step-by-step tutorial on ensuring that comments are valid XHTML with Movabletype, please let me know.
Google AdSense. I decided to have a go at displaying Google ads on the site. It works by adding a bit of JavaScript to each page that should display the ads. That code in itself is valid XHTML 1.0 Strict. Unfortunately, the JavaScript loads the ads by creating an <iframe>, which is not allowed in XHTML 1.0 Strict. Most browsers ignore that and happily display the <iframe> anyway, but browsers based on Gecko display nothing where the ads are supposed to be. I thought switching to XHTML 1.0 Transitional would fix that, since that DOCTYPE allows the <iframe> element, but no. I also had to go back to serving all documents that contain Google ads as text/html. Anyone know what’s going on with that? Am I wrong in thinking that XHTML 1.0 Transitional should allow transitional elements and attributes even when it is served as application/xhtml+xml? I haven’t looked at the AdSense JavaScript in detail, but I suppose the problem could be caused by invalid attributes in the <iframe>-tag.

I’d like to solve those two problems and be able to put the content negotiation script on all pages, and any pointers in the right direction would be much appreciated.

Update: As if he was reading my mind, Ian Hickson posted Why document.write() doesn’t work in XML just a few days before I posted this. Since the Google AdSense script uses document.write() to create the <iframe>, it won’t work in documents served as application/xhtml+xml. Pretty obvious once you think about it. That leads to the next question: is there something that could be used instead of document.write(), and could be suggested to Google?

Update 2: I have updated the PHP content negotiation script to send a Vary header. Not sure how to do that with ASP, so I’d appreciate if somebody could fill me in. I also added a link to the more advanced script at Keystone Websites.

Update 3: I have contacted Google AdSense tech support about the document.write + XHTML problem, and was informed that their engineers are currently working on a solution to the problem.

Update 4: Simon Jessey has come up with a workaround which is looking pretty good. An explanation can be found in Making AdSense work with XHTML.

Posted on August 8, 2004 in (X)HTML