The Performance Cost of the HTML Tree Builder

I’ve been thinking about the performance gap between the Validator.nu HTML Parser and Xerces. What can be attributed to the “extra fix-ups” that an HTML parser has to do and what can be attributed to my code being worse than the Xerces code?

Introduction

Tokenizing XML and HTML is pretty similar. Sure, an HTML tokenizer has to check each name character for upper case, but then an XML tokenizer has to check the silliness that is the Name production. The main difference is in tree construction layer. In general, comparing an XML parser and an HTML parser from different authors doesn’t tell much about the performance cost of the “extra fix-ups” HTML needs. Parsers may have otherwise fundamentally better or worse implementation strategies, and different code bases have enjoyed different amounts of attention and tweaking. To compare the tree construction layers, the tokenizing layer needs to be kept constant.

To run a proper benchmark, I implemented a very thin token handler that trivially maps HTML5 tokens to SAX events. This token handler is only 115 lines of code (mostly autogenerated by Eclipse) compared to 3927 lines of the real HTML5 SAX streamer code. With this thin layer, the resulting parser is similar to an XML5 parser without support for Namespaces.

Method

I chose to use the front page of Wikipedia (saved on 2008-04-01) as the test in data, because it is a well-known, real-world Web page that happens to be well-formed text/html-compatible XHTML, so it can be used for testing both HTML and XML parsers.

Since I wanted to test the parser core, I eliminated the effect of IO and character decoding by letting the parsers read pre-converted UTF-16 data from a CharArrayReader. Instead of building a tree, the parsers ran in SAX mode. The content handler an XML serializer writing to a mock Writer that wrote to nowhere. (This way, there was some code to touch each attribute in case Xerces builds attributes lazily. I have not checked if it does.) All the parsers were set to intern element and attribute name strings. The XML parsers were set not to read the DTD.

I ran each parser in a loop first for 10 minutes for warming up HotSpot and then for another 10 minutes to actually record the benchmark. I ran that tests on Mac OS X 10.5.4 on Intel Core 2 Duo (x86_64).

I also included “fast read()” variants of the Validator.nu HTML Parser. These variants have the per-character error reporting and validator-precision source location tracking commented out. (I only commented them out in the most obvious place—there’s more potential for removing stuff that’s only interesting in a validator.)

Results

Here are the results in number of iterations per time relative to the tokenizer of the Validator.nu HTML Parser with the thin SAX layer. (Note, the two VMs are not equally fast. The 1.6 x86_64 server VM is over 50% faster than the x86 1.5 client VM.)

Parser	1.5 x86 Client	1.6 x86_64 Server
Xerces-J 2.9.1 with Namespaces	109%	112%
Xerces-J 2.9.1 without Namespaces	141%	142%
Ælfred2 (Validator.nu fork) with Namespaces	72%	75%
Validator.nu HTML Parser streaming SAX mode	89%	93%
Validator.nu HTML tokenizer with thin SAX layer	100%	100%
Validator.nu HTML Parser streaming SAX mode, fast `read()`	95%	93%
Validator.nu HTML tokenizer with thin SAX layer, fast `read()`	107%	104%

The difference between Xerces with and without Namespaces sure looks interesting. Let’s see what the numbers look like relative to Xerces without Namespaces.

Parser	1.5 x86 Client	1.6 x86_64 Server
Xerces-J 2.9.1 with Namespaces	78%	79%
Xerces-J 2.9.1 without Namespaces	100%	100%
Ælfred2 (Validator.nu fork) with Namespaces	51%	53%
Validator.nu HTML Parser streaming SAX mode	63%	66%
Validator.nu HTML tokenizer with thin SAX layer	71%	71%
Validator.nu HTML Parser streaming SAX mode, fast `read()`	68%	66%
Validator.nu HTML tokenizer with thin SAX layer, fast `read()`	76%	74%

Analysis

Even when input is well-formed, the latent ability of the parser to deal with the HTML tree building rules has a cost. XHTML-as-text/html advocates take note: Making your markup well-formed doesn’t make an HTML parser go as fast as an XML parser with the same tokenizer implementation would go.
Even with Namespace support, Xerces beats the Validator.nu HTML Parser’s tokenizer (which is copying data even when not needed). In my defense, Xerces has more person years invested in it and doesn’t have validator-precision source location tracking.
Removing the per-character part of validator-precision reporting makes the Validator.nu tokenizer almost match Xerces with Namespaces on the 1.5 client VM, but the relative gain from removing reporting is smaller on the 1.6 server VM.
Namespace support in Xerces is relatively more costly than HTML5 tree building rules in the Validator.nu HTML Parser. That’t right, on a page with a default namespace for elements at the top and attributes (except on the root element) in no namespace, enabling Namespaces lops off over 20% of the performance.
An XML parser (here the patched Ælfred2) can be a lot slower than an HTML parser. XML advocates take note: Just having an XML parser doesn’t guarantee performance over HTML—especially if the HTML side is getting more attention.
The performance cost of the HTML tree builder is rather small after all: 7% on the better VM.

Summary

Xerces is faster. Namespaces are worse than the much-maligned HTML “extra fix-ups” (21% hit vs. 7% hit). An XML parser can be slow.