Why HTML
June 7th, 2008 by DeWitt Clinton

Short post here …

The thread started by Elliotte Rusty Harold (super smart guy) called Why XHTML is provoking a number of intelligent and articulate responses. Here’s my take:

I used to be firmly in the XHTML camp, but now I’m not. I’m not against people outputting valid XHTML instead of HTML, probably they should. Unless it is too hard. I just don’t think that it matters all that much in practice. My rationalization for this is pretty straightforward, and goes like this:

The web in made up of a nearly infinite variety of html-ish documents, some valid xhtml, some valid html, but mostly just almost-valid almost-html. So if you are writing a client for the web as it is, you will need to accept the most liberal input possible because a client that throws errors when it sees a missing '/>' symbol isn’t a very useful client at all.

So take the best possible case: A client sees a document, and the server declares that document’s mime type to be application/xhtml+xml or the document declares one of the xhtml doctypes, so the client fires up a validating XML parser, finds the document to be valid XML, outputs a XML tree (nothing html specific here), then walks that tree according to the most state-of-the art html heuristics (e.g., the html5 grammar), and renders the document accordingly (the hard and time-consuming part).

That’s the best case.

The worse case is: A client sees a document, and the server doesn’t bother to set a mime-type at all (or sets the wrong one) and/or doesn’t declare an xhtml doctype, and the client starts walking the characters in the document according to the most state-of-the art html heuristics (e.g., the html5 grammar), and renders the document accordingly (the hard and time-consuming part).

The difference between the best and the worst case is slight enough, and I’m not particularly convinced that the cost of running the first pass through a validating XML parser and creating a XML parse tree buys you much benefit when starting to parse the interesting stuff, the html itself.

But the real clincher is that the best case scenario is fleetingly unlikely. There are three possible outcomes here: 1) either the document isn’t declared to be xhtml at all so you fall through to the worse case, 2) the document is declared to be valid xml so you start parsing it only to find out it was invalid, so you start over again with the worst case, or 3) the document is valid xml and you have the best case scenario.

In practice on the real web, cases 1 and 2 are so much more likely that if you’re a pragmatic client you may as well start with the worst case and never bother parsing the document for the best case in the first place.

Now if you’re a document author, and you know that pragmatic clients are going to interpret your document according to the worst case rules no matter what you do, your only real incentive for writing xhtml over html is one of personal preference.

I can see some cases with machine-generated documents that emitting xhtml is actually just as easy as emitting html (say, if you are building up a DOM tree to represent your content anyway), or if you need to round-trip within a closed ecosystem of clients. But as soon as humans are involved in the production of the document, given the axiom that no human should ever have to write xml by hand, you are almost certainly going to be producing html, not valid xhtml.

Ultimately it comes down to the human element. People write the content and people will always write it using non-validating formats, if for no other reason that writing compliant documents is hard, and people have better things to do with their time than check for missing '/>' symbols. (And moreover, people also write the software that generates xhtml, and that software is often buggy and produces invalid documents anyway.)

My conclusion is that if you can produce valid xhtml, go for it. But in the end it doesn’t make much of a difference. In fact, I’m pretty certain that the web itself wouldn’t have succeeded if xhtml was required from the beginning, because a web that renders and displays documents is a much better web than one that throws validation errors all over the place.

(And all this, coming from a former xhtml guy…)

Footnote, a few of the other axioms that inform my thinking on this:

  1. HTML is intended to be written by humans (by hand).
  2. HTML is intended to be displayed to humans.
  3. Machines that want to interpret HTML need to act like humans, and not the other way around.

14 Responses to “Why HTML”

  1. Aristotle Pagaltzis Says:
    given the axiom that no human should ever have to write xml by hand, you are almost certainly going to be producing html, not valid xhtml.

    I can’t follow. Either you are going to let them write HTML directly, in which case it’s still angle-bracket markup and the “axiom” is based on an irrelevant distinction; or you’re going instead using something like a WYSIWYG editor or a shorthand syntax like Markdown, in which case the markup is programmatically generated, which means you should be able to write an emitter than produces XML.

    If you let users write markup directly, what matters is how soon the error is detected. If the parse error only shows up in the browser of a visitor, long after the author has clicked the Save button, well, FAIL. My favoured approach is to attach an HTML scrubber to the authoring environment, and when an markup error is detected, try to fix it automatically, then present the result to the user and prompt them for whether this corresponds to what they actually meant. If yes, they can rubberstamp it, otherwise they get a chance to fix the problem manually.

  2. DeWitt Clinton Says:

    There’s a big difference between “angle-bracket markup” and syntactically valid XML angle-bracket markup, though. And good point about markdown-esque formatting options (which could indeed produce xhtml) and/or scrubbers… And I like the notion of introducing a feedback loop into the mix.

  3. Craig Overend Says:

    Pretty much agree with all this myself.

    Persistence – or reliable operation – is about management: People make mistakes, interpret concepts differently and as a result things break – even tests themselves. Formalising is currently just too expensive. Valid XML should not the test; buggy, unique, clients are. Pass the target browsers or client test and most people don’t care how valid you are.

    I personally think subsets of HTML should stand on their own, just as processes in fault-tolerant systems like Erlang do. Any single component (or standard) that can take down an entire system is the weakest link.

    That said; critical parts of any system required for correct operation should require strictness and also be able to prove their own integrity on the client-side. This is where I see policy-driven strict subsets of signed documents important. Most of these strict operations are however only needed for transaction security when the risk is high; so people use HTTPS.

    I’d still like to see everything – even invalid documents – signed because the way documents stand, any point between a and b that part of that document traverses – can inject malicious content into it.

    Valid signatures should be the test, not valid documents; unless those documents are to be embedded in 3rd party content – which is analogous to a client (where the policing should happen). Then policy-driven sanitizers filtering HTML/CSS and JavaScript(Cajita and ADSafe) should be a security requirement.

  4. Robert Græsdal Says:

    Outputting xhtml instead of html is not just personal preference

    You also get 1. Future safety for easy processing of your data (most languages today support reading xml, few do html). 2. You know the user agent (UA is not just browser, also google, yahoo etc.) read your document correctly.

    If you need any of these points, output xhtml.

  5. Dave McIntyre Says:

    I didn’t bother reading past the second paragraph because XHTML is not hard. Don’t confuse laziness with difficulty.

  6. James Bennett Says:

    Dave, I’d disagree: XHTML is indeed hard, because doing XHTML exposes to to all sorts of neat little quirks and corner cases that can, unlike HTML, flat-out break your site if you don’t know about and work around them.

    For example, you have to learn two different sets of CSS rendering rules, because in HTML ‘body’ is the root element that you want to apply styles to, but in XHTML it’s ‘html’. You also get to learn the quirks of two different DOM interfaces, and work out when to use namespaced methods (XHTML DOM) and when to use non-namespaced methods (HTML DOM). And then there’s the way you have to be careful about character entities because browsers use non-validating XML parsers.

    And… well, that’s just the tip of the iceberg. XHTML is hard to do right even for experts, and don’t believe anyone who tries to tell you otherwise.

  7. bharath Says:

    I must certainly agree with your point. But, when most people use Dreamweaver or some similar software, then the software itself would take care of this issue. BTW, your statement about parsers is obvious.

  8. Le Roux Says:

    Postel’s Law: “be conservative in what you do, be liberal in what you accept from others” (often reworded as “be conservative in what you send, be liberal in what you receive”).

    So basically – try and create valid or at least nearly valid xhtml and never write a client that crashes because of “normal” html. I can think of many analogies of things where unrecoverable minor errors causing catastrophic failure would be just plain stupid.

    Most of the internet is bad html and it is very easy for invalid things to creep into otherwise valid xhtml at runtime.

    Or rather campaign for (x)html to be replaced with something easy to parse that cannot possibly be badly formed (see some of the lightweight plain text markup languages) and only supports utf8. Then browsers can start to add support for that and people can gradually switch over?

  9. John thomas Says:

    I would be inclined to agree with the original poster.

    JT http://www.FireMe.To/udi

  10. Mr Code Says:

    bEST DESCRIBOR EVER: “infinite variety of html-ish documents”

  11. Z Says:

    I’m often amazed (but shouldn’t be) when I see people who write code and can’t remember, or perhaps have never seen, what the project scope requirements are.

    In this case, the scope is to send and receive documents across a network where the user(s) are computer illiterate; AND where many of those sending data will create that data with imperfect tools and imperfect understanding of correct message formatting.

    Clearly, TBL may not have envisioned what FrontPage does to HTML, nor what many other programs do. What he did envision was programs that you, I, and aunt Jenny can just sit down and use to create data that we can all share with each other. That was the point in the first place. Doing so in a technically correct manner (while always good) is not the main point.

    I have always wished to see acid tests and test suites simply give a grade on the page for anyone to use. If your website gets an A, very cool. If it gets a B+ because it does not fully render in IE4, that’s cool too. I wish that is how many people thought of this side of the Internet. There are some very nice, very professional websites out there that really deserve a D- and never get it. The consequence of this is that those who created it and subsequently those who want to emulate it, never learn to be a bit more compatible with the rest of the world.

  12. Greg Whitescarver Says:

    Robert Græsdal makes an important point regarding future-proofing. Even if your XHTML documents are human-ish (i.e. not perfectly valid), you are producing a document that is surprisingly well-supported by the browsers in use today, and that has a great likelihood of rendering with fewer quirks in future browsers.

    We folks that write XHTML also write code that really does need to be understood by computers. Our employers and our peers will benefit if we make an effort to be machine-readable in our web documents, because, like Robert said, the web browser UI isn’t the only place our documents are consumed.

    Perhaps you’re just covering up your embarrasment that there is a ‘center’ tag on the Google search page :-P

  13. Articles and links worth checking out - 10 June 2008 | 24 Hours is not enough Says:

    [...] Why HTML [...]

  14. Bernhard Schulte Says:

    Asking the validator at W3C takes app. 10 seconds for an uploaded file. Most browsers dev toolbars offer a shortcut for that operation. If that procedure is too slow for you – grab validator and install it on your local network.

    There are no valid excuses for sloppy mark-up!