In the spirit of documenting one’s mistakes…
Yesterday, I decided to tackle a long-standing presumed performance issue with the Validator.nu HTML Parser. The parser uses an intermediate buffer for element and attribute names, system and public identifiers, comments, the doctype name, attribute values and strings starting with the ampersand that might or might not be named character references. It doesn’t use the input buffer directly, because an input buffer boundary may occur inside one of the above-mentioned strings. Also, in the case of attribute values with character references in them, the character references need to be expanded which makes the sequence of characters differ from the characters in the input buffer.
Conventional wisdom says that copying data is slow and should be avoided. When the input buffer boundary doesn’t occur inside of value and when an attribute value doesn’t contain character references, the intermediate buffer is useless. Therefore, I decided to eliminate the use of the intermediate buffer when it’s not necessary.
After quite a bit of work—especially on refactoring how hyphens in comments are handled—I had a parser build that used the input buffer directly when possible and an intermediate buffer only when needed. This involved an if (offset != -1)
check on each buffer append operation to see if an intermediate buffer needs to be appended to are just an integer incremented. I thought that reading a field and doing a conditional branch would be cheaper than doing an unconditional memory write behind a pointer and offset.
I also did some additional performance tweaking before running benchmarks. (Mistake! Benchmarks should be run after each tweak to isolate the effect of each tweak.)
The benchmark results were depressing.
Parser | Iterations per time |
---|---|
JDK Xerces (parsing as XML—not HTML) | 130% |
Validator.nu HTML Parser 1.0.7 (method per tokenizer state, no foreign content support) | 100% |
Validator.nu HTML Parser 2008-05-02 (major tokenizer refactoring to switch ) | 107% |
Validator.nu HTML Parser 2008-08-11 (just before modification described here) | 110% |
Validator.nu HTML Parser 2008-08-12 (with the buffering approach described here) | 89% |
Validator.nu HTML Parser 2008-08-12 bis (commented out code to unconditionally use the intermediate buffer) | 113% |
The benchmark is parsing the Wikipedia front page from a predecoded UTF-16 buffer in memory in streaming SAX mode and serializing as XML to a mock Writer
that does nothing. Run on HotSpot Client VM 1.5.0_13 x86. Results are relative to Validator.nu HTML Parser 1.0.7. Larger number is better.
There is a chance that wrapping the buffers in a layer of object indirection and using different classes for the two buffering modes and letting dynamic dispatch choose the right append code path could be faster. But I don’t want to spend time trying that right now.
For attribute and element names, I could build support for discontiguous buffers, but I don’t want to spend time trying that right now, either.