When I was designing the Validator.nu HTML Parser, it became apparent that full HTML error handling together with the SAX API required building a tree and then emitting SAX event when traversing the tree. Also, I could foresee cases where the user interface implementation for Validator.nu could benefit from an XML tree that can be efficiently turned into SAX events with non-locking concurrent reads.
There are multiple existing XML tree APIs for Java. The most notable ones are DOM, XOM, JDOM and dom4j. However, they represent text as String
s while SAX uses char[]
, use a lot of instanceof
when outputting to SAX, do a lot of attribute copying or don’t preserve information about source line and column. Therefore, I chose to write yet another XML tree package.
The expected use cases lead to these design principles:
ContentHandler
,
LexicalHandler
and Locator
.
These design principles are almost the exact opposite of the design principles behind XOM: XOM emphasizes integrity checking over performance while SAX Tree emphasizes performance (of certain kind) and doesn’t do integrity checking.
Contrary to the usual Java XML API design, SAX Tree does not use a list of child nodes for elements. Instead, it uses links to next sibling and the first child. Append is O(1) thanks to a reference to the last child as well. The typical implementation pattern for the tree to SAX event converter is that the SAX dependency is in the traversal code and that code has to examine each node for its type. SAX Tree is tightly coupled with SAX by design, and the visitor pattern is used instead. That is, each node knows how to emit a SAX event on visit and revisit. This way, normal dynamic method dispatch is leveraged, and an instanceof
chain is not needed. Further, the visitor traverses the tree iteratively—not recursively.
For the use cases that SAX Tree was designed for, it outperforms Xerces DOM, XOM, JDOM and dom4j. Also, SAX Tree faithfully replays everything that SAX exposes through ContentHandler
,
LexicalHandler
and Locator
. Especially preserving Locator
data is something that the four tree APIs don’t do.
I wrote the bulk of SAX Tree in June 2007, and the code has been stable since July 2007. However, I haven’t publicized it as a public API, because I wanted to retain the option to redesign some things. Now the API has been stable for over a year, so I feel the API can be considered public and stable.
SAX Tree ships in the jar for the Validator.nu HTML Parser, but you cannot currently obtain a SAX Tree Document
as the result of parsing HTML5. API docs are available.