Content negotiation, AdSense, and comments

A few weeks ago I posted about my experiences with content negotiation, and some roadblocks I ran into when looking to serve XHTML documents as ‘application/xhtml+xml’ to capable browsers.

The problems were related to the way Google ads are displayed, and the fact that it’s very hard to make sure people use valid (or at least well-formed) XHTML in comments. Thanks to some clever people, I have found workarounds for both of these problems.

Content negotiation scripts

In my previous post on this subject, I included examples of content negotiation scripts for PHP and ASP. As pointed out by Simon Jessy, the scripts did not include a Vary header. That has been fixed in the PHP version. Note that the script I included is somewhat simplified. A more advanced script can be found in Serving up XHTML with the correct MIME type.

However, I have not been able to find any information on how to send a Vary header with ASP (Classic). If there are any ASP experts reading this, perhaps you can help me with that.

Google AdSense

First, the problems with Google AdSense. The JavaScript function document.write() is used to create the HTML code used for the ads. When XHTML is served as ‘application/xhtml+xml’, document.write() doesn’t work, as is explained by Ian Hickson in Why document.write() doesn’t work in XML. Even if that did work, the ads are embedded in an iframe, which is not allowed in XHTML 1.0 Strict, my preferred flavour.

Simon Jessey came up with a workaround, described in detail in Making AdSense work with XHTML. In short, a separate HTML 4.01 document containing the ad code is created, and then inserted into the document where the ads are to be displayed by using an object element.

As a precaution, before implementing this method I contacted Google to check if they would have a problem with it. I didn’t want to risk my AdSense account being cancelled. After a day or so I got a reply from The Google Team, stating that they had reviewed Simon’s article and found the method acceptable. So, now there is a way to have Google ads on a page served as ‘application/xhtml+xml’.

Keeping comments valid

The second roadblock was comments. Comments are great, and an important part of any site of this kind. However, allowing HTML in comments makes it possible for non well-formed or invalid markup to sneak into documents. Once you serve a document as ‘application/xhtml+xml’, it must be well-formed, or strictly conforming browsers will throw an error instead of displaying the page.

One possibility is to disallow HTML in comments, something I have tried and didn’t like. I want people that take the time to comment to be able to add things like quotes, links, and emphasis to their comments. I looked at ways to validate comments, but couldn’t figure out how to implement them. Then I found Markdown, a Movabletype plugin by John Gruber.

Markdown uses a simple text formatting syntax, which gets converted to XHTML when the document is rebuilt. That way, comments can contain links, emphasis, blockquotes, lists, and more, even though direct HTML input is not allowed. The drawback is that anyone leaving comments will need to know some Markdown syntax unless their comment is just plain text. It’s very simple though, and I’ve added some basic instructions to the comment form. I’m giving it a try.

Serving it up

Finding these workarounds has allowed me to use content negotiation to serve my documents as ‘application/xhtml+xml’ to browsers that can handle it, and as ‘text/html’ to all others. I’d like to keep it that way, so I’m hoping that the folks at Google don’t change their minds, and that all comments stay valid.

Update: It looks like the problems with Safari mentioned in the comments to this post aren’t actually problems at all. Safari seems to be interpreting XHTML sent as ‘application/xhtml+xml’ very strictly, and does not recognise any named character entities except for &, <, >, ', and ".

Any others are not displayed as the expected character - they are just written to the page with no conversion.

Posted on September 1, 2004 in (X)HTML, Search Engine Optimisation, Web Standards

Comments

  1. Hmmm. This Markdown stuff is interesting.

    Not sure I understand the line break thing.

    Ah well.

  2. Forgot to say: thanks for the hints! I’m sure lots of bloggers will find this useful. Myself, I can’t be bothered with XHTML, but it’s very nice to see someone working on practical solutions to make it into an option.

  3. Maybe we should suggest that Google adopt a more standards complaint approach to their AdWords.

  4. What many sites seem to do wrong is that they only use content negotiation to vary the MIME type, not the actual content.

    What I do is this: if the user agent prefers XHTML, it gets XHTML 1.1 as application/xhtml+xml. Otherwise, it get HTML 4.01 Strict as text/html.

    Basicly, the whole thing is quite silly. If you can serve it as text/html, you don’t use any advantages of X(HT)ML, so you might just go ahead and mark it up as HTML in the first place. Having said that, I tell myself that there is an advantage in it for me anyway. I use XHTML markup and convert it to HTML automatically, when needed, through a simple PHP function. Thus, on that glorious day when IE is either dead or has learned to handle application/xhtml+xml, I can just remove the content negotation script and start using XHTML more seriously. :)

  5. Are you removing all the characters from iso-8859-1 that are not allowed per the XML specification? Otherwise your site might be non well-formed in seconds.

  6. September 1, 2004 by Roger Johansson (Author comment)

    Yes, the next step is to convert XHTML to HTML before sending it to IE and other browsers that don’t want XHTML.

    Anne: are you referring to &, <, >, ' and "? Anyway, I am going to switch to UTF-8. I just need to make sure I don’t break anything by doing so.

  7. The whole point of my content negotiation script is that it does convert XHTML into HTML. I agree that it is a largely pointless exercise, but I saw it as the most elegant way of preparing for the time when IE will accept the XHTML properly. Unfortunately, I am still having problems with the RegEx for the script.

    I’m glad to hear that Google approves of my AdSense method. When yoou think about it, I am not actually messing with their script at all. I am just changing the way it is served slightly.

  8. September 1, 2004 by Roger Johansson (Author comment)

    Simon: I was referring to the script I’m using, which is pretty similar to yours, but does not convert XHTML to HTML. What kind of problems are you having btw?

    Regarding AdSense, I was concerned that their code would get confused by being in an iframe in an HTML document in an object element in an XHTML document ;p

  9. The regular expression doesn’t quite work right, although it is pretty close. Basically, it is supposed to be able to identify and compare numbers from 0 to 1 inclusive, up to 3 decimal places of accuracy. For example, it may need to compare 0.998 with 0.999, although I concede this is highly unlikely! I’m not really all that familiar with RegEx, so I’m having a spot of difficulty with it. I blogged about it earlier, but apparently it still isn’t right.

    By the way, any chance of a “remember me” cookie?

  10. Simon, my content negotiation script is remarkably similar to yours, although I did a simpler regex detection of quality values.

    Your ‘fix_buffer’ function only handles self-closing elements. Mine also converts xml:lang attributes into lang attributes.

    Anne, I don’t know anything about this. I currently serve my pages with ISO8859-1. Are you saying that this encoding can cause well-formedness errors? I know that I should use UTF-8, and I’m looking into that, but I have a hard time imagining a web browser that doesn’t support ISO8859-1 in the near future.

  11. September 1, 2004 by Roger Johansson (Author comment)

    Ok. I’ve updated the script I use for content negotiation to handle self-closing elements. Not implemented on all pages yet though.

    One thing that bugs me is that none of the scripts I have seen seem to detect that Safari handles ‘application/xhtml+xml’. I added a browser check for Safari to my script… yuck. Anyone know of a better way?

  12. September 1, 2004 by Roger Johansson (Author comment)

    Simon, I’ve added a “Remember me” cookie to my to-do list.

  13. Markdown looks damn good. I just wish it was in PHP rather than Perl. I’m fluent in both but I prefer the former.

  14. Safari seems to send this:
    Accept: */*

    When I force XHTML to Safari, it botches everything up, apparently by applying all style sheets - including the one with media=”print”!

    As long as Safari doesn’t handle XHTML properly, and doesn’t specify application/xhtml+xml with a higher q value than for text/html, it will get HTML 4.01 Strict from me.

  15. September 2, 2004 by Roger Johansson (Author comment)

    Tommy: Yeah, I also noticed some problems with Safari, so I’m back to serving it HTML 4.01 Strict + ‘text/html’.

  16. September 2, 2004 by Eric TF Bat

    (Interesting… I got my comment edited. I wonder why! I’d put in some (ahem) humourous links (Church of Satan for Perl, the PBS Sesame Street site for PHP) so maybe I was too controversial. Hey ho; seems like a silly thing to do, risk offending a poster just for that. Still, it’s Roger’s site; he can do what he likes until I start paying his web hosting…)

    I found a link to a PHP port of Markdown, which I think might just solve my problems. We shall see.

  17. September 2, 2004 by Roger Johansson (Author comment)

    Eric: Hehe. Yup. I removed those slightly off topic links ;-)

  18. Anne’s talking about codepoints invalid in your character encoding. One of the easiest wellformedness constraints to break is that you must not have a byte (or series of bytes) that either isn’t defined in your character encoding, or is one of the control characters not allowed in XML. ISO-8859-1 is particularly vulnerable, because browsers will silently submit Windows-1252 instead: paste a Euro symbol into your comments, and you’ll submit the Win-1252 version, and your page will explode. Better drop Jacques’ StripControlCharacters plugin in, right quick.

    Worse yet, mt-comments.cgi isn’t delivering application/xhtml+xml, so even someone trying to be good won’t know until they submit.

  19. September 2, 2004 by Robert

    A quick Google search (asp http header +vary) led me to Herong’s Notes on ASP at Geocities.

    I work with ASP, though I’ve never set HTTP headers with it. From the examples given, it looks like you would set the Vary header with the Response.AddHeader hethod, something like:

    Response.AddHeader “Vary”,”Accept”

    You might give that a try and see if it properly sets the header for you.

  20. September 2, 2004 by Roger Johansson (Author comment)

    Phil: Thanks for explaining that, and for mentioning the StripControlCharacters plugin. Installed :) Is this something you need to watch out for regardless of which character encoding you use, or is UTF-8 safe?

    I know about mt-comments.cgi. I haven’t figured out how to work around that. Dave Shea has written something about it in PHP, CGI, and MT: Together at Last, but I haven’t taken a closer look at it.

    Robert: Weird. I spent quite some time googling to find info on that, and came up with nothing. Guess I didn’t use the right search terms. Anyway, thanks for the link. I’ll give it a try when I get to the office tomorrow.

  21. Worse yet, mt-comments.cgi isnt delivering application/xhtml+xml, so even someone trying to be good won’t know until they submit.

    Naturally, Jacques Distler has that covered, too: Serve It Up! (And I can confirm that the changes to App.pm work just fine with MT 3.1, soemthing for which I am grateful, because my MT 2.661 -> MT 3.1 upgrade has been, ah, tricky.)

  22. I’m surpised you are using Movable Type, to be honest. Your penchant for tinkering with it makes me think that WordPress would be more up you street.

    Come to think of it, I’m surprised you haven’t built your own CMS. I know that you are more than capable enough, and it is a great process to go through.

  23. UTF-8 isn’t safe, just safer. Any charset has characters which aren’t allowed in XML, and bytes which are undefined as characters in that charset. If you just use your browser’s “View - Character Encoding” menu to change to something other than the page’s actual encoding, you’ll then submit in that other charset, making a hash of things. (You can try to work around that a little bit by having a hidden form element with the name “charset”, which both IE and Mozilla (not sure about Opera or Safari) will fill in with the charset they used on what they submit, but that doesn’t stop someone from maliciously submitting with an incorrect value for charset.)

    UTF-8 is a little safer, in that there isn’t the institutionalized “ignore the charset, all ISO-8859-1 is Windows-1252” problem, but if you are going to accept content for an XML page, you have to verify that every character is defined in your charset, no matter which it is.

    As to MT vs. WP for tinkering, well, I’ve managed to stay entertained for more than two years tinkering with MT. WP’s more fun to play with, in some ways, but it also carries a lot of b2 baggage that could use refactoring. MT started out with clean OOP, so once you wrap your head around it there aren’t as many surprises.

  24. September 3, 2004 by Roger (Author comment)

    Im surpised you are using Movable Type, to be honest.

    I haven’t had enough problems with it to take a really close look at other options. Yet.

    Come to think of it, Im surprised you havent built your own CMS.

    I suppose I could, but it isn’t anything I have considered. I may change my mind about that some day.

    UTF-8 is a little safer, in that there isnt the institutionalized “ignore the charset, all ISO-8859-1 is Windows-1252” problem, but if you are going to accept content for an XML page, you have to verify that every character is defined in your charset, no matter which it is.

    Ok. Thanks for clearing that up. This sounds like a perfect job for a plugin, like the StripControlCharacters plugin. Is there something similar for UTF-8?

  25. For now, I’m using Alexei’s UTF8Hack while I plug away at writing something better able to deal with the real source of invalid characters: Trackback.

  26. September 4, 2004 by Roger (Author comment)

    Well. I tried converting to UTF-8, and ran into problems, so I reverted to iso-8859-1. Any posts that contain accented or otherwise special characters were seriously messed up. Not sure why. I converted the whole database to utf-8 (by exporting and reimporting into MT, which is very tedious), changed all templates and everything else. Edited mt.cfg to to use itf-8. Bah. I’m going to stay away from this for a few days now.

  27. September 22, 2004 by Roger Johansson (Author comment)

    In case anyone is still reading the comments here, I’d like to let you know that it looks like the problems with Safari mentioned in the above comments aren’t actually problems. Instead it seems Safari is interpreting XHTML sent as ‘application/xhtml+xml’ very strictly, and does not recognise any named character entities except for &amp;, &lt;, &gt;, &apos;, and &quot;.

    Any others are not displayed as the expected character - they are just written to the page with no conversion.

Comments are disabled for this post (read why), but if you have spotted an error or have additional info that you think should be in this post, feel free to contact me.