Bogo-XML Declaration Returns to Gecko

Firefox 89 was released today. This release (again!) honors a character encoding declaration made via syntax that looks like an XML declaration used in text/html (if there are no other character encoding declarations).

Before HTML parsing was specified, Internet Explorer did not support declaring the encoding of a text/html document using the XML declaration syntax. However, Gecko, WebKit, and Presto did. Unfortunately, I didn’t realize that they did.

When Hixie specified HTML parsing, consistent with IE, he didn’t make the spec sensitive to the XML declaration syntax in a particular way. I am unable to locate any discussion in the WHATWG mailing list archives about whether an encoding declaration made using the XML declaration syntax in text/html should be honored when processing text/html.

When I implemented the specified HTML parsing algorithm in Gecko, I also implemented the internal encoding declaration handling per specification. As a side effect, in Firefox 4, I removed Gecko’s support for the XML declaration syntax for declaring the character encoding in text/html. I don’t recall this having been a knowingly-made decision: The rewrite just did strictly what the spec said.

When WebKit and Presto implemented the specified HTML parsing algorithm, they only implemented the tokenization and tree building parts and kept their old ways for handling character encoding declarations. That is, they continued to honor the XML declaration syntax for declaring the character encoding text/html. I don’t recall the developers of the either engine raising this as a spec issue back then.

The closest to the issue getting raised as a spec issue was for the wrong reason, which made people push back instead of fixing the spec.

When Blink forked, it inherited WebKit’s behavior. When Microsoft switched from EdgeHTML to Blink, Gecko became the only actively-developed major engine not to support the XML declaration syntax for declaring the character encoding text/html. Since unlabeled UTF-8 is not automatically detected, this became a Web compatibility issue with pages that declare UTF-8 but only using the XML declaration syntax (i.e. without a BOM, a meta, or HTTP-layer declaration as well).

And that’s why support for declaring the character encoding via the XML declaration syntax came to the HTML spec and back to Gecko.

What Can We Learn?

Additional Observations (2021-06-02)