Character references in HTML and XML
To find out everything you need to know about numeric character references and character entity references in HTML and XML, read Lachlan Hunt’s article Character References Explained.
The conclusion is that if you use character entity references such as – in HTML documents, they work fine. If the same documents are then converted to XHTML and sent as application/xhtml+xml, there is no guarantee that the character entity references will continue to work.
Some browsers include a pseudo-DTD catalog which contain the character entity references, letting the browser display them as if the document was HTML. Other browsers will simply omit character entity references other than the five that are pre-defined in XML.
The safest is to use numeric character references.
- Previous post: Mac users clinging to Internet Explorer
- Next post: CSS 2.1 selectors, Part 3
Subscribe / follow
Sponsors
Authentic Jobs
- Perl Developer at Booking.com (Amsterdam, The Netherlands, NL)
- Software Engineer at charity: water (New York, NY, Ne, US)
- Web Designer/Web Designer Senior at University of Michigan (Ann Arbor, Mi, Mi, US)
- PHP Developer at 428 Designs
DreamHost web hosting
Use the promo code 456BEREASTREET3 to save USD 20 when you sign up for DreamHost


Comments
I recommend Character entities cheat sheet from ILoveJackDaniels
Cheers; Poncho
Instead of that, why not use pure UTF godness?
Thanks Roger, but stop pointing your finger at me … ;) We all learn something new every day and I’ll read the above mentioned article thourougly, promise. :D
I think there are still way too many how-tos, tutorials etc. on the web where there is written that you should use – etc. instead of numeric references, although they don’t seem to have any disadvantages.
I would agree with Mini-d, UTF-8 or another encoding that supports the characters you want would be better.
Some problems: - Homesite (if you like it as much as I do!) doesn’t support unicode! I switched to jedit instead (www.jedit.org) — not perfect, but highlighy customizable. - You might be stuck in a situation where either half your team uses something that saves your UTF files as ANSI (hence the question marks and blocks on the web pages!) or your CMS may not help you much…
There are probably a good number of other problems, but where you are in control or have influence selecting the right character encoding in the first place would seem best.
Poncho: Handy chart, that.
mini-d, Anup: Absolutely, using UTF-8 is better, but not always an option. And it is harder to make everything work properly when you use UTF-8 since application support is not all the way there yet.
Mats: No fingers pointed your way ;-).
I think you misunderstood something in my article. Browsers that support application/xhtml+xml will always (or at least should) treat the document as XHTML, however because no browser uses a validating parser, they don’t read the external DTD.
As a work around, Mozilla has a pseudo-DTD cataloge containing just the ENTITY declarations and includes it as if they were all declared within the internal subset (which is read by all XML parsers).
For example, the following contains an entity declaration within the internal subset. “foo” is the entity name and “bar” is the replacement text.
<!DOCTYPE html[ <!ENTITY foo “bar”> ]> <html xmlns=”http://www.w3.org/1999/xhtml”> <p>&foo;</p> </html>
(try that code in Firefox, it will actually work)
Lachlan: That was basically what I meant, but I tried to simplify things a bit ;-). I updated that part and I hope it is correct now.
My problem is that I’ve been spoilt by various XML-based toolchains like Docbook that let you define entities in internal subsets within the source document, and use those entities throughout the ensuing XML. I haven’t been able to get XHTML to work that properly, which is a major bummer in my book; it means I’m essentially going to have to go back to a full-blown XSLT transformation just to be able to code replaceable entities. Look at the linked page (seven-sigma dot com slash foo dot html) to see a trivial example of what I’m talking about.
Comments are disabled for this post (read why), but if you have spotted an error or have additional info that you think should be in this post, feel free to contact me.