The perils of using XHTML properly

I've been using XHTML for a couple of years now, but it wasn't until last summer that I started looking at using it properly, that is by serving it with the application/xhtml+xml MIME type. I knew about some of the problems I was going to run into, but far from all of them. As you're about to find out, there are plenty of seemingly small issues that can make life difficult when you start using real XHTML.

Please note that this article is not an argument for or against the use of XHTML. I'm just documenting the potential pitfalls that I'm aware of, and will leave it up to you to decide what you want to use: HTML 4.01, XHTML 1.0 served as text/html to all browsers, or XHTML 1.0 served as application/xhtml+xml to browsers that handle it and as text/html to others. Or maybe something completely different.

I became aware of the gotchas one after the other, as I encountered the situations where they can occur. In some cases I had to spend quite some time looking for info and asking for help before finding a solution. But I learned from it, and I'm going to let you know what I would have liked to know when I started using XHTML.

Note that the issues I mention here only occur in user agents that properly handle the application/xhtml+xml MIME type, and therefore treat XHTML as XML. That is probably the major reason that these issues were not mentioned a lot in the early days of XHTML – very few people were using such web browsers, so almost nobody bothered to serve XHTML as anything but text/html.

Today, actually serving XHTML as application/xhtml+xml is becoming slightly more common. There are two reasons as I see it:

  1. The number of people using Firefox, Mozilla, Opera, Safari, and other XHTML capable browsers has increased a lot, so you're not doing it just for yourself and your fellow geeks. Well, maybe you are, but it will affect many more.
  2. There is an increased awareness of what XHTML actually is among web developers. There have been several, sometimes heated, debates on the use of XHTML, especially when served as text/html. If you've taken part in any such discussions, you know what I mean.

If you, like me, decide to implement some kind of content negotiation and use the correct media type to deliver XHTML, you need to know what can (and will) happen to the documents you publish, and how to avoid problems. For some interesting reading on the subject of content negotiation, as well as examples of scripts that will perform the content negotiation, I'd like to refer you to Content Negotiation and Serving up XHTML with the correct MIME type. There are more articles of that kind, but those two are among the best I've read on the subject.

Some of the more obvious differences between HTML and XHTML are listed in every basic XHTML tutorial: use lowercase for element and attribute names, always quote attribute values, don't use attribute minimisation, make sure all elements have end tags and that no elements are incorrectly nested, etc. However, there is more to be aware of when XHTML is being served as application/xhtml+xml.

Well-formedness is required

Documents must be well-formed XML (which is not necessarily the same as valid XHTML). No compromises, no room for error. If documents aren't well-formed, conforming browsers (currently I'm aware of Mozilla, Firefox, Netscape, Camino, Opera, Safari, and OmniWeb – pretty much any browser but Internet Explorer) will display an error message and abort rendering the document in one way or another.

Among other things, this means no more unencoded ampersands.

The XML declaration may be required

If you use any other character encoding than UTF-8 or UTF-16, the XML declaration is required unless the encoding is provided by the HTTP header.

Whether or not character encoding should be specified in the HTTP headers is slightly unclear. Architecture of the World Wide Web, Volume One: Media Types for XML states that

In general, a representation provider SHOULD NOT specify the character encoding for XML data in protocol headers since the data is self-describing.

On the other hand, here's what XHTML 1.0, Second Edition: Character Encoding says:

In order to portably present documents with specific character encodings, the best approach is to ensure that the web server provides the correct headers.

Either way, it's good practice to specify the character encoding in the XML declaration:

<?xml version="1.0" encoding="iso-8859-1"?>

Only five named entities are safe

Only the five predefined named entities (&lt;, &gt;, &amp;, &quot;, and &apos;) are guaranteed to be supported. Others may be completely omitted or output literally. For example, if your XHTML document contains entities like &nbsp; or &rdquo;, that is what Safari will render. Literally. Opera instead chooses to omit the unknown entities, while the Mozilla family will recognise the entities and render them as in HTML if the document references a public identifier for which there is a mapping in the browser's pseudo-DTD catalog and the document has not been declared standalone.

Using the UTF-8 character encoding, which is the recommended best practice, lets you use (almost) any characters you like by typing them into your document, without the need for entities or character references. If you can't or won't use UTF-8, numeric character references are supported and safe to use.

The contents of SGML comments may be discarded

SGML comments (HTML-style comments, <!-- comment -->) may be (and are) treated as comments by browsers, even when used inside script or style elements.

In HTML, it is common to enclose the contents of script and style blocks in comments to hide them from browsers that do not recognize script or style elements, and would render their contents as plain text on the page.

In XHTML, doing so will cause browsers to ignore anything inside the comment.

The practice of hiding the contents of script and style elements from old browsers is a habit from way back in the mid nineties. In my experience, browsers that behave this way are so rare that you can safely ignore them, and stop enclosing scripts and style sheets in SGML comments, even if you use HTML.

Contents of script and style elements are treated as XML

The style and script elements are PCDATA (parsed character data) blocks, not CDATA (character data) blocks. Because of this, anything in them that looks like XML will be parsed as XML, and cause an error unless it is well-formed.

In order to use <, &, or -- in a style or script block, you need to wrap its content in a CDATA section:

  1. <script type="text/javascript">
  2. <![CDATA[
  3. ...
  4. ]]>
  5. </script>

Inside a CDATA section, you can use any sequence of characters without it being parsed as XML (except ]]>, which ends the CDATA section).

For documents to safely be sent as text/html when necessary, the opening and closing tags of the CDATA section need to be commented out to hide them from browsers that don't handle CDATA sections:

  1. <script type="text/javascript">
  2. // <![CDATA[
  3. ...
  4. // ]]>
  5. </script>
  1. <style type="text/css">
  2. /* <![CDATA[ */
  3. ...
  4. /* ]]> */
  5. </style>

If you want to make sure that really old browsers don't see the contents of a CDATA section, you need to use a more complicated method, as described by Ian Hickson in Sending XHTML as text/html Considered Harmful:

  1. <script type="text/javascript">
  2. <!--//--><![CDATA[//><!--
  3. ...
  4. //--><!]]>
  5. </script>
  1. <style type="text/css">
  2. <!--/*--><![CDATA[/*><!--*/
  3. ...
  4. /*]]>*/-->
  5. </style>

An even better solution would be to let your content negotiation script remove any CDATA sections before serving the document as text/html.

A sidenote: I've seen Opera have problems with commented CDATA sections in XHTML. When a commented CDATA section is present within a style element, Opera (tested in 7.54 Mac) ignores the first stylesheet rule and any @import rules in the entire style element. Anyone know if this is a bug in Opera or if the behaviour can be explained by something else?

Of course, the cleanest and safest way is to move all CSS and JavaScript to external files. That's not always practical though.

No elements are inferred

In HTML, a table's tbody element will be inferred by the browser if it's missing from the markup. Not so in XHTML. If you don't explicitly add tbody, it doesn't exist. Keep this in mind when writing CSS selectors and JavaScript.

Scripting with document.write doesn't work

When JavaScript is used with XHTML, document.write() does not work. Ian Hickson explains why in Why document.write() doesn't work in XML. You need to use document.createElementNS() instead. More info on that can be found in a forum thread at Experts Exchange.

This is one of the reasons that Google AdSense doesn't work with XHTML. For those who wish to serve XHTML as application/xhtml+xml and have Google ads, there is a workaround described by Simon Jessey in Making AdSense work with XHTML. A bit messy, but it works (I'm using it here), and is approved by Google.

Referencing style elements

In XHTML, to be compatible with the XML method for defining CSS rules, you should use an XML stylesheet declaration (Called XML stylesheet declaration in XHTML 1.0, Second Edition: Referencing Style Elements when serving as XML, and xml-stylesheet processing instruction in Associating Style Sheets with XML documents.) to load an external CSS file. When using a style element, you should use an XML stylesheet declaration to reference the style element. To do this, use the id attribute for the style element to give it a fragment identifier, and then reference that in the XML stylesheet declaration:

  1. <?xml-stylesheet href="stylesheet1.css" type="text/css"?>
  2. <?xml-stylesheet href="#stylesheet2" type="text/css"?>
  3. <!DOCTYPE html
  4. PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  5. "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
  6. <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  7. <head>
  8. <title>XML stylesheet declaration</title>
  9. <style type="text/css" id="stylesheet2">
  10. @import "stylesheet2.css";
  11. </style>
  12. </head>

I'm not sure how much of a requirement this actually is, and if there are any problems associated with not using XML stylesheet declarations. Maybe someone can enlighten me.

CSS is applied slightly differently

CSS properties applied to the body element don't apply to the whole viewport in XHTML. This is most notable when a background colour or image is applied. In HTML, a background applied to the body element will cover the entire page. In XHTML, you need to style the html element as well. There is a demonstration of this behaviour in CSS body Element Test at Juicy Studio.

Element and attribute names in CSS rules are case sensitive in XHTML (and must be lowercase). It's very simple to avoid problems with case-sensitivity: just make sure all element names, attribute names, and selectors are lowercase, whether you're writing HTML, XHTML, or CSS.

Challenging, but not impossible

When I decided to start serving XHTML as application/xhtml+xml to capable browsers, it would have saved me some headaches if I had known about an article like this to read before making that decision. I might even have considered using HTML 4.01 Strict instead. Nevertheless, I've learned from the experience, and learning is always good.

Hopefully this has provided you with a bit more information about what it actually means to use XHTML properly, and you can make a slightly more informed decision as to whether you want to go that way or not.

There are probably even more differences between HTML and XHTML than those I've mentioned here, so feel free to add any additional pitfalls you've encountered when serving XHTML as application/xhtml+xml. If you can spot any errors or omissions, do tell.

Update: Some clarifications and links to references added.

Translations

This article has been translated into the following languages:

Posted on January 18, 2005 in Web Standards, JavaScript, CSS, (X)HTML