Last weekend, Slashdot linked to an article that observed that Netscape had removed the RSS 0.91 DTD. Netscape quickly restored the resource on a temporary basis. I hope this episode has a silver lining and helps in making people realize that DTDs don’t belong on the Web.
My initial thought was “Wow people are actually using XML parsers with RSS” and next “but don’t have to good sense to disable the processing of external entities”. It has been interesting to observe how different people react to story. I think Netscape’s position that they want to get rid of the burden is actually quite reasonable, although the initial removal was obviously due to the New Netscape being ignorant about the activities of the Old Netscape.
There’s a lot of HTML and XHTML sample code out there that contains a doctype that points to a DTD on www.w3.org. The illusion that this somehow “works” is based on browsers not actually retrieving the referenced DTDs. This is why www.w3.org doesn’t melt down under a massive ongoing distributed denial of service attack. This is also why the Web keeps working when there is a power outage at MIT and www.w3.org is temporarily unreachable.
The RSS 0.91 DTD incident shows how bad an idea the remote DTD becomes when it is for real and not just an illusion maintained to keep the appearances. First, there’s a single point of failure. When the DTD became unavailable, apps around the world stopped working. Not good. Second, the RSS 0.91 DTD is retrieved over 4 million times per day. That’s nuts. Burdening a single third party like that for something as useless as a DTD makes no sense.
Although I now have a bit of a “told you so” feeling, when I was younger, I too thought that loading external entities is a problem that needs to be solved. I have gotten over it, though. The right way to fix the use of DTDs on the Web is not to use them on the producer side and not to resolve them on the consumer side.
If you are a spec writer for an XML vocabulary, please don’t specify a DTD for the vocabulary. Please suggest that implementations run their XML parsers with external entity resolution disabled. Please note, however, that banning the doctype like SOAP does (or otherwise subsetting XML) is improper in terms of spec layering.
If you are an implementor of a Web-facing XML-consuming app, please configure your XML parser not to perform DTD-based validation and not to resolve external entities. If, for legacy reasons, you must process some well-known DTDs, please make your entity resolver retrieve those DTDs from a local catalog. (For the internal subset, be sure to have protection against the Billion Laughs attack in place.)
If you publish XML files on the Web, please don’t include a
doctype. (This point is specifically not about
Unfortunately, writing about this requires pre-emptively refuting all the usual misconceptions, so here goes:
The document won’t be “valid” in the sense of the term defined in XML 1.0, but this does not matter. Validity in the sense defined in XML 1.0 is dubious and overrated.
DTDs aren’t a particularly powerful validation formalism. Moreover, having the document declare its own grammar is worthless as far as the ability of a consumer to trust the document adhering to particular rules is concerned. RELAX NG validation takes two distinct inputs—the document and the schema—and the document cannot override the schema. In the RSS 0.91 case, the Feed Validator (that uses custom Python code instead of any schema formalism) does a much better job at checking the feed syntax than DTD-based validation would.
There’s not-so-great spec writing out there. You can try what happens if you don’t comply. ☺
text/html content, yes,
but this isn’t about HTML. For documents served using an XML
content type, no.
Using the doctype as an indicator of the type of the document (what kind of document it is) is utterly bogus. Please don’t let the unfortunate name “doctype” or the HTML 4.01 spec fool you.
If a consuming application can handle multiple XML vocabularies, the right way to dispatch document to different handlers is to check the namespace of the root element. For vocabularies that aren’t in a namespace, the right way to dispatch is to look at the MIME type. Vocabularies that aren’t in a namespace and don’t have a vocabulary-specific MIME type are badly behaved and you have to dispatch on the root element name.
Yeah, unfortunately. Fortunately, WML 2.0 is not really that relevant to the Web. Opera and WebKit-based browsers are doing a good job obsoleting the concept of the “Mobile Web” and making WML a legacy footnote.
Ad hoc subsetting is wrong. Having different spec profiles is also bad in general, but the XML spec itself effectively defines three processing profiles:
Not performing DTD-based validation, not retrieving external entities.
Not performing DTD-based validation, retrieving external entities to perform infoset augmentation and entity reference expansion.
Performing DTD-based validation, retrieving external entities to perform infoset augmentation and entity reference expansion.
The whole point of having profiles is to allow apps to choose a suitable one. The creators of the XML spec made profile #1 for browsers. It would be crazy for Web-facing apps not to take the opportunity to avoid the DTD cruft when given the chance. In fact, the whole concept of well-formedness is there to support DTDlessness!
It is a done deal. Many people are just still in denial. Character entities (other than the 5 predefined ones) became unsafe for the Web when the XML 1.0 spec made them an optional feature (profile #1 above). On the Web, you can’t count on optional features (which is why it is a bad idea for Web-oriented specs to have optional features).
The situation is unfortunate, but this really is an input method problem between you and your editor. Fixing it in the wire format is the wrong place. However, if I had a chance go back in time and change XML 1.0, I’d define all the XHTML and MathML entities as predefined. But it is too late, because much of the value of XML is in interoperable off-the-shelf parsers and changing XML would break the interop (which is why XML 1.1 is such a bad idea). So you just need to figure out a better input method and use straight UTF-8.
I am aware of that. What Gecko does is dirty, should be fixed and definitely should not be encouraged.
You establish it by mutating the infoset reported by the XML processor based on some criteria related to element and attribute names and namespaces. An “xml:id processor” does this. A conceptually analogous “XHTML id processor” could be defined. (XHTML5 implies such a processing component.) Both can be implemented as SAX filters.
Ha ha. We’re out of frequent questions.