DTDs Don’t Work on the Web

Last weekend, Slashdot linked to an article that observed that Netscape had removed the RSS 0.91 DTD. Netscape quickly restored the resource on a temporary basis. I hope this episode has a silver lining and helps in making people realize that DTDs don’t belong on the Web.

My initial thought was “Wow people are actually using XML parsers with RSS” and next “but don’t have to good sense to disable the processing of external entities”. It has been interesting to observe how different people react to story. I think Netscape’s position that they want to get rid of the burden is actually quite reasonable, although the initial removal was obviously due to the New Netscape being ignorant about the activities of the Old Netscape.

A Single Point of Failure is Bad

There’s a lot of HTML and XHTML sample code out there that contains a doctype that points to a DTD on www.w3.org. The illusion that this somehow “works” is based on browsers not actually retrieving the referenced DTDs. This is why www.w3.org doesn’t melt down under a massive ongoing distributed denial of service attack. This is also why the Web keeps working when there is a power outage at MIT and www.w3.org is temporarily unreachable. (Update: www.w3.org is under a massive DDoS attack all the time—but not from browsers. Interestingly, the questions raised by the W3C Systeam don’t include “Should we admit that DTDs are a bad idea and we should get rid of them?” Update: Microsoft issues updates for various versions of MSXML to avoid parsing failure due to www.w3.org banning the IP address of the host running MSXML.)

The RSS 0.91 DTD incident shows how bad an idea the remote DTD becomes when it is for real and not just an illusion maintained to keep the appearances. First, there’s a single point of failure. When the DTD became unavailable, apps around the world stopped working. Not good. Second, the RSS 0.91 DTD is retrieved over 4 million times per day. That’s nuts. Burdening a single third party like that for something as useless as a DTD makes no sense.

Advice

Although I now have a bit of a “told you so” feeling, when I was younger, I too thought that loading external entities is a problem that needs to be solved. I have gotten over it, though. The right way to fix the use of DTDs on the Web is not to use them on the producer side and not to resolve them on the consumer side.

FAQ

Unfortunately, writing about this requires pre-emptively refuting all the usual misconceptions, so here goes:

But if I don’t have a doctype, my document cannot be validated, right?

The document won’t be “valid” in the sense of the term defined in XML 1.0, but this does not matter. Validity in the sense defined in XML 1.0 is dubious and overrated.

DTDs aren’t a particularly powerful validation formalism. Moreover, having the document declare its own grammar is worthless as far as the ability of a consumer to trust the document adhering to particular rules is concerned. RELAX NG validation takes two distinct inputs—the document and the schema—and the document cannot override the schema. In the RSS 0.91 case, the Feed Validator (that uses custom Python code instead of any schema formalism) does a much better job at checking the feed syntax than DTD-based validation would.

But a spec says I must use a doctype. What can I do?

There’s not-so-great spec writing out there. You can try what happens if you don’t comply. ☺

Isn’t a doctype needed to trigger the standards layout in browsers?

For text/html content, yes, but this isn’t about HTML. For documents served using an XML content type, no.

How can a consuming application determine the type of a document without a doctype?

Using the doctype as an indicator of the type of the document (what kind of document it is) is utterly bogus. Please don’t let the unfortunate name “doctype” or the HTML 4.01 spec fool you.

If a consuming application can handle multiple XML vocabularies, the right way to dispatch document to different handlers is to check the namespace of the root element. For vocabularies that aren’t in a namespace, the right way to dispatch is to look at the MIME type. Vocabularies that aren’t in a namespace and don’t have a vocabulary-specific MIME type are badly behaved and you have to dispatch on the root element name.

Are you aware of WML 2.0?

Yeah, unfortunately. Fortunately, WML 2.0 is not really that relevant to the Web. Opera and WebKit-based browsers are doing a good job obsoleting the concept of the “Mobile Web” and making WML a legacy footnote.

But DTDs are part of the XML spec. You said subsetting is improper. Shouldn’t apps support the whole spec?

Ad hoc subsetting is wrong. Having different spec profiles is also bad in general, but the XML spec itself effectively defines three processing profiles:

  1. Not performing DTD-based validation, not retrieving external entities.

  2. Not performing DTD-based validation, retrieving external entities to perform infoset augmentation and entity reference expansion.

  3. Performing DTD-based validation, retrieving external entities to perform infoset augmentation and entity reference expansion.

The whole point of having profiles is to allow apps to choose a suitable one. The creators of the XML spec made profile #1 for browsers. It would be crazy for Web-facing apps not to take the opportunity to avoid the DTD cruft when given the chance. In fact, the whole concept of well-formedness is there to support DTDlessness!

Aren’t you disallowing character entities?

It is a done deal. Many people are just still in denial. Character entities (other than the 5 predefined ones) became unsafe for the Web when the XML 1.0 spec made them an optional feature (profile #1 above). On the Web, you can’t count on optional features (which is why it is a bad idea for Web-oriented specs to have optional features).

The situation is unfortunate, but this really is an input method problem between you and your editor. Fixing it in the wire format is the wrong place. However, if I had a chance go back in time and change XML 1.0, I’d define all the XHTML and MathML entities as predefined. But it is too late, because much of the value of XML is in interoperable off-the-shelf parsers and changing XML would break the interop (which is why XML 1.1 is such a bad idea). So you just need to figure out a better input method and use straight UTF-8.

Don’t you know that Gecko maps MathML entities to PUA characters so using straight UTF-8 with the real astral characters is different?

I am aware of that. What Gecko does is dirty, should be fixed and definitely should not be encouraged.

How do I establish IDness without DTD processing?

You establish it by mutating the infoset reported by the XML processor based on some criteria related to element and attribute names and namespaces. An “xml:id processor” does this. A conceptually analogous “XHTML id processor” could be defined. (XHTML5 implies such a processing component.) Both can be implemented as SAX filters.

What about NOTATIONs?

Ha ha. We’re out of frequent questions.