I am working on a conformance checking service for (X)HTML5. The service is grammar-based for the most part with RELAX NG as the schema language. Some extra-grammatical constraints are expressed as Schematron assertions. Currently, as a Mozilla Foundation grantee, I am working on writing checkers (in Java) for spec features that cannot (practically or at all) be checked using RELAX NG or Schematron.
In a Web two-point-ohey perpetual beta fashion, I am deploying the new prototype features early to allow testing.
The first non-schema checker prototype is a table integrity checker. Since the table model for (X)HTML5 is now being specified, the prototype is speculatively based on the HTML 4.01 table model and browser behavior. The differences from HTML 4.01 are that colspan='0'
is treated as colspan='1'
and that headers
must refer to th
cells. The top left corner of cells is placed in the first available slot on the row, which is browser-compatible but different from what the CSS2 spec says.
The checker emits both warnings and errors. Depending on how the spec turns out, errors may become warnings or vice versa.
Currently, the errors are:
Table cell is overlapped by later table cell.
Table cell overlaps an earlier table cell. (Single overlap gets reported in both directions to show source location for both cells.)
Table cell spans past the end of its row group.
Row has no cells starting on it.
Table row column count is greater than the column count established by cols/colgroups.
Table row column count is less than the column count established by cols/colgroups.
The headers attribute doesn’t point to th elements in the same table.
Column has no cells starting on it. (Contiguous cell ranges established by a single element are coalesced to a single error to protect against denial of service attacks.)
Currently, the warnings are:
colspan exceeds 1000, which is a magic number in Gecko (and according to comments in Gecko source, in IE and Opera, too)
rowspan exceeds 8190, which is a magic number in Gecko
Table row column count is greater than the column count established by the first row in the absence of cols/colgroups.
Table row column count is less than the column count established by the first row in the absence of cols/colgroups.
A col element causes a span attribute to be ignored on the parent colgroup. (Conforming in HTML 4 / XHTML 1.0; non-conforming in (X)HTML5. With (X)HTML5 there’s also a schema-level error.)
The table integrity checker only sees a projection of the document tree that contains nothing but table-significant elements and crazy subtrees of table-significant elements in wrong places are silently pruned. These are dealt with on the RELAX NG level. The table integrity checker assumes that it is being used together with a reasonable schema.
The table integrity checker is also enabled for the HTML 4.01 / XHTML 1.0 presets on the generic side of the service, so testing with today’s content is possible.
There’s a pseudo-schema called http://hsivonen.iki.fi/checkers/table/
which isn’t a schema but a magic URL that causes the system to instantiate the table integrity checker. There’s a pseudo-pseudo-schema called http://hsivonen.iki.fi/checkers/all/
which expands to all pseudo-schemas, but at the moment, there’s only one.
Please let me know if the table integrity checker does not work as advertised.
Cross-posted to the WHAT WG blog. Comments enabled there.