Tag Soup: How Mac IE 5 and Safari handle <x> <y> </x> </y>

In November 2002 Hixie wrote a blog entry titled Tag Soup: How UAs handle <x> <y> </x> </y> about the DOM outcomes of parsing a sample tag soup document in Mozilla, Windows IE 6 and Opera 7 beta. Those aren’t the only notable HTML+CSS+DOM implementations, though. The two notable codebases Hixie didn’t cover are Tasman (the engine of Mac IE 5) and KHTML (the engine of Konqueror and Safari).

Safari 1.0

With Safari, there are no surprises. When Hyatt added “residual style” support (closing and reopening unclosed inline elements when a block element closes and another one opens) to the HTML parser of Apple’s fork of KHTML, he mostly followed what Mozilla does. As a result, Safari does what Mozilla does with Hixie’s example when only the BODY subtree is observed. That is, both Mozilla and Safari deal with misnested markup at the parser level, the DOM tree is always a tree and the layout uses the same content tree that is exposed via the DOM. This approach is probably the easiest to cope with. (Note: In Hixie’s Mozilla diagram b and the lower EM are erroneously marked as siblings.)

There are some differences with Safari and Mozilla, though, but those differences appear elsewhere in the DOM tree. Mozilla represents whitespace between tags as text nodes even when the content model forbids character data. For example, in Mozilla, there’s a text node between the HEAD and BODY nodes. In Safari, there isn’t. Mozilla puts comments in the DOM (and innerHTML) but Safari doesn’t.

Mac IE 5.2.2

The results with Tasman are more interesting. Like Windows IE 6, Mac IE 5 also builds a non-tree DOM out of misnested markup. However, Mac IE 5 does it differently:

[A diagram showing that the DOM outcome in Tasman is not a tree]

Things to note

There’s exactly one element node per start tag.
A node can be a child of more than one node. (Each node has only one parentNode, though.)
The nextSibling and previousSibling references aren’t necessarily reciprocal.
There’s no node d at all. However, the text of d appears in the innerHTML of ADDRESS.

Have fun running some tree traversal algorithms on that data structure.

Of course, the possible non-treeness of the DOM in Tasman raises questions about the internal data structures of the engine. Why did they choose to accommodate tag soup by allowing non-treeness of the parsed document when the DOM and CSS specs require a tree? Surely, the non-treeness is likely to cause problems in both Trident and Tasman.

It turns out there is some layout weirdness going on as well with the test document. The style properties are applied to different pieces of content depending on the width of the view port.

[Blue background after the element with the blue bg has been closed; expected text color, though]