Understanding and extending semantics in HTML

In a series of three long articles (Part I - Traditional HTML Semantics, Part II - Standardizing Vocabularies, and Part III - Directions in HTML Semantics), John Allsopp expands his thoughts on how the HTML based Web can be improved to allow for better semantics.

After explaining what “semantics” actually means, John defines three different semantic classifications that an HTML element can belong to:

structural
Defines document structure. Examples: div, span, h1 - h6, ul, ol, dl, p.
content
Defines the type of content it marks up. Examples: abbr, address, code.
rhetorical
Defines rhetoric added by the author. Examples: em, strong.

The full list, which includes attributes, is available in Classifying the semantics of HTML. I haven’t seen the elements and attributes of HTML classified like this before, but it all makes sense to me.

The conclusion John comes to is (unless I am misunderstanding something) that extending HTML by adding new elements is a bad idea. Adding a new semantic element only works until there is a need for something else, at which point the loop starts again. And so on. Instead, John calls for a way of infinitely extending HTML, much like microformats.

He does have a point, though I’m not sure I agree with it. I’m not saying I disagree either, just that I don’t know. What do you think? Should new elements and attributes be added to the HTML specification when there is a need for them? Should there be another way of extending and improving the semantics of HTML without requiring the specification to be updated? Perhaps combining the two approaches would be better?

That’s a lot of questions. Anyone have answers?

Posted on August 31, 2007 in (X)HTML, Quicklinks

Comments

  1. August 31, 2007 by madr

    IMHO there is a lack of content markups in current doctypes. It is often useful to write <span class="author">madr</span> or <span class="date">today</span> to both add semantics and extend presentation, for example. DATE or AUTHOR elements would be really nice instead.

  2. I would love to see a semantic way of denoting navigation. It seems odd that something that is on pretty much every website does not have a way of denoting it. For me the nl tag in XHTML 2 is a sensible addition.

  3. Well, I was a bit confused, since I did not see the <table> element in the list of content elements. Why? We should use tables, where we show tabular data, no?

  4. Well, I think enhancing the semantic resolution of the web is great, but fails in reality. If it takes years for an element like abbr to be added to a certain mainstream browser, what will happen to other new additions? I fear for the worst.

    However, if webdevelopers would entice visitors to use only open-source software, it might work very well.

    But nonetheless, I think that it would be much better if webdevelopers would be able to add elements and define them in a custom DTD that fills in the gaps of the standard W3C DTD. So, if you know some browsers lack support of abbr, you define it yourself, like
    defining custom functions in a programming language on top of the existing ones.

  5. I love the idea of a open source participation for web creators where we can act also on usability and design issues for w3c… delgated responsability!!

    No organisation can possible set a standard for all needs for billion of internet users and that is why w3c have so much problems and never will finish their work.

    So W3C, intive and delegate responsability to interestesed highly skilled developers, creatative people with usability skills for the future…

    Thats what we need….

    Michael

  6. (note: I’m only responding on this article and its comments, I have yet to read through John’s articles, for which I do not have time at the moment)

    While adding elements sounds like a neat idea, I’m not at all sure it improves semantics. In response to Marc Siepman (4): while adding elements to a custom DTD will make a document containing your own made-up elements validate, it will not necessarily improve semantics. You, being the web developer, will be able to make sense of custom elements, but when user agents won’t recognize them, that extended vocabulary will never reach the user, will it? Also, services reading your document, like search engines, won’t understand the elements and all your effort is probably lost.

    I might be wrong, but the way I see it is this: if the English-speaking world invents the word “Glopkin”, it will have semantically extended its vocabulary and English-speaking people will have new and better ways to express themselves. For someone living in a third-world country however, it will still not make any sense. The only way adding elements will properly improve semantics is when all services and user agents add it to their vocabulary at the same time.

  7. I for one think semantics should be considered even when difficult. If a tag isn’t supported or isn’t yet existent I still think the use of id’s and class names should clearly show what the element represents.

  8. Yes Harmen, that is something I have been thinking about. But to what extent do user agents understand what they parse?

    If I want a new element, say init for initialism or psbl for pseudo-blend, I would be able to explain to a user agent that it’s a sort of abbreviation. I think that would be an improvement. It would be even better if you could define it even more detailed.

    Any way, it’s not just the markup that is important, but also the power of the parsing engine.

  9. I guess this is where the role attributes kicks in.

  10. I found the classification very useful and realised that there are semantics markups I didn’t pay any attention to. Who can really tell offhand the difference between “code” and “samp”?

    Semantics are useful to the extend that they are understood by the audience (directly by user-agent or rendered in a browser or by a reader). So we already have problems with existing tags (check http://www.w3.org/TR/html401/struct/text.html#edef-ABBR for examples of use of abbr and acronym). You will find that certain abbr like WWW shall be read as three consecutive W and that Mass. is like a word, same with acronyms WAC (W then A then C) and radar (word).

    My point is that we need to have a close look at the standards and should not be afraid of toileting, removing unecessary tags and adding new ones (I’d like to use author I mentioned before).

  11. Sorry no answers Roger, but perhaps a question that may lead to some…

    HTML (with a greater degree of certainty in older versions), follows a document metaphor, the original sets of elements were selected to provide a mark-up framework for research papers, thus elements like headers, paragraphs etc. The one notable exception from these being the form elements which although away from the metaphor is a sensible addition (and can have a paper based equivalent).

    If we are adding new elements what metaphor are they following, and how are we changing the nature of our mark-up?

  12. I don’t think that we really need too many more semantical elements within the HTML standard. What we as developers need is more control over the existing semantic elements. I would be content having new elements roll out when a legit need arises. However a larger priority should be taking an already semantic element like an unordered list, and being able to make it behave like a table. Mozilla already has display: table-cell. Only when we have a firm control over the existing elements, should we start looking back to add new ones. Otherwise the same cycle will repeat. Besides, giving more control over elements will only provide an additional incentive for developers and even amateurs to write more semantic code.

  13. I’m not sure whether it’s an answer or not, but I like the idea of spending more effort on creating semantic microformat standards. Semantically I’m not convinced by a number of things that people are talking about with html 5 - or that they couldn’t be achieved through standard names of classes and ids. Or perhaps even the addition of some sort of type attribute.

    My very simplistic approach is that you mark up the structural groupings of content on a page (and HTML seems to currently do this) and then you can use various classifications to control how they are presented. Given that I don’t really see why HTML needs to be expanded. And with standards compliance being the issue it is with browser vendors, why would we look for a more complicated solution?

  14. August 31, 2007 by Roger Johansson (Author comment)

    George: There are many sensible things in XHTML 2. I hope at least some of them will make it into HTML 5.

    Adam: Yes, we should use the table element for tabular data. I don’t know why it’s missing. I suppose it would fit in the “structural” category.

    Gerben:

    I guess this is where the role attributes kicks in.

    Yeah, but HTML 5 doesn’t have a role attribute (not yet anyway).

    Philippe:

    My point is that we need to have a close look at the standards and should not be afraid of toileting, removing unecessary tags and adding new ones

    HTML 5 will do that, though I would like it to remove even more cruft and add more new elements than the current draft indicates.

    Richard:

    If we are adding new elements what metaphor are they following, and how are we changing the nature of our mark-up?

    I think some should follow the document metaphor, while others should be more targeted at Web applications, where the concept of a document is not as clear.

    Eddie:

    I don’t think that we really need too many more semantical elements within the HTML standard. What we as developers need is more control over the existing semantic elements.

    That is up to browser vendors, not the W3C. The HTML and CSS specifications are already there and allow pretty much complete control. The problem is that browser vendors haven’t fully implemented HTML and CSS (especially CSS).

  15. August 31, 2007 by Amit

    There should be like a “master tag” that will describe each chunk of information (div, p, ul etc.) that will be called something like <meaning = “something”>, this “something” should adhere to a list of predefined attributes and search engines should look for those attributes in an “advance search” mode (that will cease to be advanced and will be the normal search mode), kind of like adding meta tags in the body of your document/html.

    This tag should not effect the visual look of things - it should only be used for describing the content in one word (much like writing attributes in xml, isn’t html a form of xml?!), this feature should be scalable and fluid allowing the web to evolve and include new definitions.

  16. While I can understand his point, the web is an ever evolving place, and HTML should evolve with it. We’re constantly adding and refining new types of content, so why shouldn’t we refine the way we support that content?

  17. I am looking forward to HTML5 and am excited about the new elements. But even while I think there should be some new elements, I also think there should be a way of infinitely extending HTML — and there already is. It’s the combination of elements like div and span with class and id attributes.These already let you, effectivley, invent your own new elements.

    That’s not to say people should investigate new, more elegant approachs. If something better comes along, great. But for now, I feel comfortable with the way we can already “extend” HTML.

  18. For a long time I’ve thought that the current state of HTML and the proposed additions to HTML 5 are woefully inadequate for creating a rich semantic language that we can mechanically pull much meaning from. Microformats are nice step in the right direction, but I see so much more potential in XML as the underlying language of the web(as I’ve argued here before)

    You want to add more semantics to your ecommerce store, such as marking up the price, options, product number? Simply pick an existing schema or write, document and publish your own. There’s no waiting around on the W3, and then browser vendors to decide what elements should be used, what they should be called, and what they should mean. We simply don’t need a large consensus body making these decisions. We can simply document the rules of semantics that our pages live by, and then all that information is there in a solid, documented structure for anyone else to make sense of later.

    Really, everything is 99% there for switching to XML already. There are a few browser quirks in IE, and it’d be nice if the search engines would reassure everyone that XML will be treated as a first class citizen in the same way as HTML.

    The only items I want HTML to define and get a consensus on are FUNCTIONALITY items. The proposed video, progress, and similar elements that let me DO more in the browser are awesome. I couldn’t care less about, aside, dialog, time, etc. I would much prefer the widely supported ability to author documents in a way that fits the meaning of my content, and then point to DTD, XSD or similar definition that lets the interested consumer pull the meaning out of that document.

  19. I haven’t read the article you refernced yet, but I want to point out that hReview creator, a microformat generator I use often, uses a blockquote for the entire item description. I don’t find this sematically accurate.

    You said, “Instead, John calls for a way of infinitely extending HTML, much like microformats.”

    My point is, if we extend the semantic reach of HTML with anything, we need to be sure the semantics are, in fact, accurate.

  20. @Scott Wehrenberg

    Spot on Scott!

    The “X” in XHTML is there for a reason.

    Now if we only had real content/capability negotiation that defined which versions of various functionality extensions a browser supports (and not limited to mainly preferred MIME-types and Browser-identification strings)…

    Yeah I know :( - But I can dream can’t I :)

  21. Nah I seriously think he have enough already so why more?? Would new mark up really help us and add something new for web design???

    I think XHTML needs work on but I doubt new mark up would solve the issues…

  22. Its an interesting idea. I think microformats are certainly useful, but for me html elements are far more concise and meaningful.

    I also think things like roles in XHTML 2 can be added to this mix. What I would ideally like to see is this:

    Popular conventions become markup elements over time. There will still be elements missing, so for those either microformats, or roles (or a combination) would be useful. If there is a critical mass on those new things, they could become elements. And repeat!

    So my view is that microformats are great, but perhaps a stepping stone to facilitate better markup.

    Hard to tell if it is practical. For example, how browsers, assistive technology such as screen readers, developers etc keep up may be an issue. Will the microformat community want to do stuff which may be temporary? How many versions of HTML will we end up with! [This may be why XML could potentially be so useful. Combining the ideas behind XHTML 2 and HTML 5 might be a useful start?]

  23. September 1, 2007 by Ben Boyle

    I think both approaches are extremely useful.

    Additional elements/attributes in HTML are great for semantics that are really widespread and common: things everyone can use. The nav and section elements seem good examples of this in HTML5. I approach this by thinking: should this be in the HTML spec and should all UAs would implement it. The things that bother us because we can’t reliably use them and we’re searching for solutions to non-existing semantics. You do what you can, but it feels like hacking, and you wonder if it’s useful to anyone at the end of the day anyway …

    If I think a “semantic feature” is aimed at a specific set of users (which is still likely to be a large audience on the Internet) then there are other options …

    At one end of the scale you might go fully XML with an agreed schema between parties. Microformats sit in a nice middleground where everything pretty much still works for everyone (it’s still HTML) but there’s a bit more available. If you’re interested in that, you may need to pimp your browser to get access.

    To me it seems to be the “progressive enhancement” approach to semantics. I think it works rather well: microformats and operator is a good example. I’m sure there is plenty more action yet to come in this space.

    Just my opinion :)

  24. I personally believe (uh hum…) that there are more than enough HTML elements to go around. Where I think value can be added is in the attribute department. Being able to remove an anchor element when marking up a navigation list and adding the href attribute to an li elemenet would be spectactular: eg. <li href=”/hi.html”>Your nav text</li>

    I’m sure there are some other very exciting attribute additions that other smarted people could come up with.

  25. the three links on top to the series of article appear not to be working.

  26. As more types of businesses move to the web the need to express a custom business domain will increase. After all, HTML was originally made for markup of technical reports (hence the kbd and samp elements).

    In XHTML 2 this has been solved by incorporating RDFa. RDFa allows authors to use custom vocabularies without requiring a central organization to manage the markup.

    Inclusion of RDFa is also mentioned as a goal for the HTML 5 working group:

    The HTML WG is encouraged to provide a mechanism to permit independently developed vocabularies such as Internationalization Tag Set (ITS), Ruby, and RDFa to be mixed into HTML documents.

    The beauty of RDFa is that it takes us a lot closer to the semantic web.

Comments are disabled for this post (read why), but if you have spotted an error or have additional info that you think should be in this post, feel free to contact me.