Character references in HTML and XML

To find out everything you need to know about numeric character references and character entity references in HTML and XML, read Lachlan Hunt’s article Character References Explained.

The conclusion is that if you use character entity references such as – in HTML documents, they work fine. If the same documents are then converted to XHTML and sent as application/xhtml+xml, there is no guarantee that the character entity references will continue to work.

Some browsers include a pseudo-DTD catalog which contain the character entity references, letting the browser display them as if the document was HTML. Other browsers will simply omit character entity references other than the five that are pre-defined in XML.

The safest is to use numeric character references.

Posted on October 23, 2005 in Quicklinks

Comments

  1. October 23, 2005 by Poncho

    I recommend Character entities cheat sheet from ILoveJackDaniels

    Cheers; Poncho

  2. Instead of that, why not use pure UTF godness?

  3. Thanks Roger, but stop pointing your finger at me … ;) We all learn something new every day and I’ll read the above mentioned article thourougly, promise. :D

  4. I think there are still way too many how-tos, tutorials etc. on the web where there is written that you should use – etc. instead of numeric references, although they don’t seem to have any disadvantages.

  5. I would agree with Mini-d, UTF-8 or another encoding that supports the characters you want would be better.

    Some problems: - Homesite (if you like it as much as I do!) doesn’t support unicode! I switched to jedit instead (www.jedit.org) — not perfect, but highlighy customizable. - You might be stuck in a situation where either half your team uses something that saves your UTF files as ANSI (hence the question marks and blocks on the web pages!) or your CMS may not help you much…

    There are probably a good number of other problems, but where you are in control or have influence selecting the right character encoding in the first place would seem best.

  6. October 24, 2005 by Roger Johansson (Author comment)

    Poncho: Handy chart, that.

    mini-d, Anup: Absolutely, using UTF-8 is better, but not always an option. And it is harder to make everything work properly when you use UTF-8 since application support is not all the way there yet.

    Mats: No fingers pointed your way ;-).

  7. Some browsers will treat them as if the document was HTML, while others will display parsing errors or simply omit them. The safest is to use numeric character references.

    I think you misunderstood something in my article. Browsers that support application/xhtml+xml will always (or at least should) treat the document as XHTML, however because no browser uses a validating parser, they don’t read the external DTD.

    As a work around, Mozilla has a pseudo-DTD cataloge containing just the ENTITY declarations and includes it as if they were all declared within the internal subset (which is read by all XML parsers).

    For example, the following contains an entity declaration within the internal subset. “foo” is the entity name and “bar” is the replacement text.

    <!DOCTYPE html[ <!ENTITY foo “bar”> ]> <html xmlns=”http://www.w3.org/1999/xhtml”> <p>&foo;</p> </html>

    (try that code in Firefox, it will actually work)

  8. October 25, 2005 by Roger Johansson (Author comment)

    Lachlan: That was basically what I meant, but I tried to simplify things a bit ;-). I updated that part and I hope it is correct now.

  9. My problem is that I’ve been spoilt by various XML-based toolchains like Docbook that let you define entities in internal subsets within the source document, and use those entities throughout the ensuing XML. I haven’t been able to get XHTML to work that properly, which is a major bummer in my book; it means I’m essentially going to have to go back to a full-blown XSLT transformation just to be able to code replaceable entities. Look at the linked page (seven-sigma dot com slash foo dot html) to see a trivial example of what I’m talking about.

Comments are disabled for this post (read why), but if you have spotted an error or have additional info that you think should be in this post, feel free to contact me.