I’ve been thinking about the performance gap between the Validator.nu HTML Parser and Xerces. What can be attributed to the “extra fix-ups” that an HTML parser has to do and what can be attributed to my code being worse than the Xerces code?
Tokenizing XML and HTML is pretty similar. Sure, an HTML tokenizer
has to check each name character for upper case, but then an XML
tokenizer has to check the silliness that is the Name
production. The main difference is in tree construction layer. In
general, comparing an XML parser and an HTML parser from different
authors doesn’t tell much about the performance cost of the “extra
fix-ups” HTML needs. Parsers may have otherwise fundamentally
better or worse implementation strategies, and different code bases
have enjoyed different amounts of attention and tweaking. To compare
the tree construction layers, the tokenizing layer needs to be kept
constant.
To run a proper benchmark, I implemented a very thin token handler that trivially maps HTML5 tokens to SAX events. This token handler is only 115 lines of code (mostly autogenerated by Eclipse) compared to 3927 lines of the real HTML5 SAX streamer code. With this thin layer, the resulting parser is similar to an XML5 parser without support for Namespaces.
I chose to use the front page of Wikipedia
(saved on 2008-04-01) as the test in data, because it is a
well-known, real-world Web page that happens to be well-formed
text/html
-compatible XHTML, so it can be used for
testing both HTML and XML parsers.
Since I wanted to test the parser core, I eliminated the effect of
IO and character decoding by letting the parsers read pre-converted
UTF-16 data from a CharArrayReader
. Instead of building
a tree, the parsers ran in SAX mode. The content handler an XML
serializer writing to a mock Writer
that wrote to
nowhere. (This way, there was some code to touch each attribute in
case Xerces builds attributes lazily. I have not checked if it does.)
All the parsers were set to intern element and attribute name
strings. The XML parsers were set not to read the DTD.
I ran each parser in a loop first for 10 minutes for warming up HotSpot and then for another 10 minutes to actually record the benchmark. I ran that tests on Mac OS X 10.5.4 on Intel Core 2 Duo (x86_64).
I also included “fast read()
” variants of the
Validator.nu HTML Parser. These variants have the per-character error
reporting and validator-precision source location tracking commented out.
(I only commented them out in the most obvious place—there’s more
potential for removing stuff that’s only interesting in a
validator.)
Here are the results in number of iterations per time relative to the tokenizer of the Validator.nu HTML Parser with the thin SAX layer. (Note, the two VMs are not equally fast. The 1.6 x86_64 server VM is over 50% faster than the x86 1.5 client VM.)
Parser | 1.5 x86 Client | 1.6 x86_64 Server |
---|---|---|
Xerces-J 2.9.1 with Namespaces | 109% | 112% |
Xerces-J 2.9.1 without Namespaces | 141% | 142% |
Ælfred2 (Validator.nu fork) with Namespaces | 72% | 75% |
Validator.nu HTML Parser streaming SAX mode | 89% | 93% |
Validator.nu HTML tokenizer with thin SAX layer | 100% | 100% |
Validator.nu HTML Parser streaming SAX mode, fast read() |
95% | 93% |
Validator.nu HTML tokenizer with thin SAX layer,
fast read() |
107% | 104% |
The difference between Xerces with and without Namespaces sure looks interesting. Let’s see what the numbers look like relative to Xerces without Namespaces.
Parser | 1.5 x86 Client | 1.6 x86_64 Server |
---|---|---|
Xerces-J 2.9.1 with Namespaces | 78% | 79% |
Xerces-J 2.9.1 without Namespaces | 100% | 100% |
Ælfred2 (Validator.nu fork) with Namespaces | 51% | 53% |
Validator.nu HTML Parser streaming SAX mode | 63% | 66% |
Validator.nu HTML tokenizer with thin SAX layer | 71% | 71% |
Validator.nu HTML Parser streaming SAX mode, fast read() |
68% | 66% |
Validator.nu HTML tokenizer with thin SAX layer,
fast read() |
76% | 74% |
Even when input is well-formed, the latent ability of the parser to deal with the HTML tree building rules has a cost. XHTML-as-text/html advocates take note: Making your markup well-formed doesn’t make an HTML parser go as fast as an XML parser with the same tokenizer implementation would go.
Even with Namespace support, Xerces beats the Validator.nu HTML Parser’s tokenizer (which is copying data even when not needed). In my defense, Xerces has more person years invested in it and doesn’t have validator-precision source location tracking.
Removing the per-character part of validator-precision reporting makes the Validator.nu tokenizer almost match Xerces with Namespaces on the 1.5 client VM, but the relative gain from removing reporting is smaller on the 1.6 server VM.
Namespace support in Xerces is relatively more costly than HTML5 tree building rules in the Validator.nu HTML Parser. That’t right, on a page with a default namespace for elements at the top and attributes (except on the root element) in no namespace, enabling Namespaces lops off over 20% of the performance.
An XML parser (here the patched Ælfred2) can be a lot slower than an HTML parser. XML advocates take note: Just having an XML parser doesn’t guarantee performance over HTML—especially if the HTML side is getting more attention.
The performance cost of the HTML tree builder is rather small after all: 7% on the better VM.
Xerces is faster. Namespaces are worse than the much-maligned HTML “extra fix-ups” (21% hit vs. 7% hit). An XML parser can be slow.