Character references in HTML and XML

To find out everything you need to know about numeric character references and character entity references in HTML and XML, read Lachlan Hunt’s article Character References Explained.

The conclusion is that if you use character entity references such as – in HTML documents, they work fine. If the same documents are then converted to XHTML and sent as application/xhtml+xml, there is no guarantee that the character entity references will continue to work.

Some browsers include a pseudo-DTD catalog which contain the character entity references, letting the browser display them as if the document was HTML. Other browsers will simply omit character entity references other than the five that are pre-defined in XML.

The safest is to use numeric character references.

  • October 23, 2005
  • Comments closed
  • Posted in

Comments

1. October 23, 2005 by Poncho

I recommend Character entities cheat sheet from ILoveJackDaniels

Cheers; Poncho

2. October 24, 2005 by mini-d

Instead of that, why not use pure UTF godness?

3. October 24, 2005 by Mats Lindblad

Thanks Roger, but stop pointing your finger at me ... ;) We all learn something new every day and I'll read the above mentioned article thourougly, promise. :D

4. October 24, 2005 by SilentWarrior

I think there are still way too many how-tos, tutorials etc. on the web where there is written that you should use – etc. instead of numeric references, although they don't seem to have any disadvantages.

5. October 24, 2005 by Anup

I would agree with Mini-d, UTF-8 or another encoding that supports the characters you want would be better.

Some problems: - Homesite (if you like it as much as I do!) doesn't support unicode! I switched to jedit instead (www.jedit.org) -- not perfect, but highlighy customizable. - You might be stuck in a situation where either half your team uses something that saves your UTF files as ANSI (hence the question marks and blocks on the web pages!) or your CMS may not help you much...

There are probably a good number of other problems, but where you are in control or have influence selecting the right character encoding in the first place would seem best.

6. October 24, 2005 by Roger Johansson

Poncho: Handy chart, that.

mini-d, Anup: Absolutely, using UTF-8 is better, but not always an option. And it is harder to make everything work properly when you use UTF-8 since application support is not all the way there yet.

Mats: No fingers pointed your way ;-).

7. October 25, 2005 by Lachlan Hunt

Some browsers will treat them as if the document was HTML, while others will display parsing errors or simply omit them. The safest is to use numeric character references.

I think you misunderstood something in my article. Browsers that support application/xhtml+xml will always (or at least should) treat the document as XHTML, however because no browser uses a validating parser, they don't read the external DTD.

As a work around, Mozilla has a pseudo-DTD cataloge containing just the ENTITY declarations and includes it as if they were all declared within the internal subset (which is read by all XML parsers).

For example, the following contains an entity declaration within the internal subset. "foo" is the entity name and "bar" is the replacement text.

<!DOCTYPE html[ <!ENTITY foo "bar"> ]> <html xmlns="http://www.w3.org/1999/xhtml"> <p>&foo;</p> </html>

(try that code in Firefox, it will actually work)

8. October 25, 2005 by Roger Johansson

Lachlan: That was basically what I meant, but I tried to simplify things a bit ;-). I updated that part and I hope it is correct now.

9. February 19, 2006 by Jeff Dickey

My problem is that I've been spoilt by various XML-based toolchains like Docbook that let you define entities in internal subsets within the source document, and use those entities throughout the ensuing XML. I haven't been able to get XHTML to work that properly, which is a major bummer in my book; it means I'm essentially going to have to go back to a full-blown XSLT transformation just to be able to code replaceable entities. Look at the linked page (seven-sigma dot com slash foo dot html) to see a trivial example of what I'm talking about.

Sorry, comments are closed for this post.

Information, sponsorship, and externals

About the author

Roger Johansson is a Swedish web professional specialising in web standards, accessibility, and usability. More about me and this site.

Subscribe

Looking for web hosting?

Try DreamHost!

Use the promo code 456BEREASTREET3 to save USD 20 when you sign up!

Latest articles

Validation statistics from Nikita the Spider Comments off
An analysis of the sites crawled by the bulk validation tool Nikita the Spider during March 2008.
Authentic Jobs API and Affiliates program Comments off
The Authentic Jobs job listing service now has a public API and an affiliate program.
What does Acid3 mean to you and me? Comments off
Opera and Apple have announced that their web browsers pass the Acid3 Browser Test, but how will that help web designers and developers?
Designing Web Navigation (Book review) Comments off
Learn the fundamentals of navigation design and design better navigation systems for large and small sites as well as for web based applications.
DOMAssistant bundle for TextMate Comments off
To save keystrokes and speed up development I have created a DOMAssistant bundle for TextMate.
First impressions of Internet Explorer 8 Beta 1 Comments off
My impressions after trying out Internet Explorer 8 Beta 1 for a couple of days.

More articles

Favourites, here and elsewhere

Affiliation

  • NetRelations
  • Kaffesnobben
  • Dagens recept
  • 9rules network member

Support this site

Show your support by buying a book or two from SitePoint or getting me something from my Amazon Wish List.