Introducing SAX Tree

When I was designing the Validator.nu HTML Parser, it became apparent that full HTML error handling together with the SAX API required building a tree and then emitting SAX event when traversing the tree. Also, I could foresee cases where the user interface implementation for Validator.nu could benefit from an XML tree that can be efficiently turned into SAX events with non-locking concurrent reads.

There are multiple existing XML tree APIs for Java. The most notable ones are DOM, XOM, JDOM and dom4j. However, they represent text as Strings while SAX uses char[], use a lot of instanceof when outputting to SAX, do a lot of attribute copying or don’t preserve information about source line and column. Therefore, I chose to write yet another XML tree package.

The expected use cases lead to these design principles:

Preserve information exposed through ContentHandler, LexicalHandler and Locator.
Creation from SAX events or as part of the parse of a conforming HTML5 document should be fast.
Emitting SAX events based on the tree should be fast.
Mutations should be possible but should not make the above “fast” cases slower.
Concurrent reads should work without locking when there are no concurrent mutations.
The user of the API has the responsibility of using the API properly: for the sake of performance, the model does not check if it is being used properly. Improper use may, therefore, put the model in an inconsistent state.

These design principles are almost the exact opposite of the design principles behind XOM: XOM emphasizes integrity checking over performance while SAX Tree emphasizes performance (of certain kind) and doesn’t do integrity checking.

Contrary to the usual Java XML API design, SAX Tree does not use a list of child nodes for elements. Instead, it uses links to next sibling and the first child. Append is O(1) thanks to a reference to the last child as well. The typical implementation pattern for the tree to SAX event converter is that the SAX dependency is in the traversal code and that code has to examine each node for its type. SAX Tree is tightly coupled with SAX by design, and the visitor pattern is used instead. That is, each node knows how to emit a SAX event on visit and revisit. This way, normal dynamic method dispatch is leveraged, and an instanceof chain is not needed. Further, the visitor traverses the tree iteratively—not recursively.

For the use cases that SAX Tree was designed for, it outperforms Xerces DOM, XOM, JDOM and dom4j. Also, SAX Tree faithfully replays everything that SAX exposes through ContentHandler, LexicalHandler and Locator. Especially preserving Locator data is something that the four tree APIs don’t do.

I wrote the bulk of SAX Tree in June 2007, and the code has been stable since July 2007. However, I haven’t publicized it as a public API, because I wanted to retain the option to redesign some things. Now the API has been stable for over a year, so I feel the API can be considered public and stable.

SAX Tree ships in the jar for the Validator.nu HTML Parser, but you cannot currently obtain a SAX Tree Document as the result of parsing HTML5. API docs are available.