HTML Syntax Checker in PHP

HTMLSyntaxChecker is a PHP class for checking the syntax of HTML input. HTMLSyntaxChecker is a linter. It is not a validator in the SGML sense. The design goal is that the input HTMLSyntaxChecker accepts is also valid HTML 4.01. However, the opposite is not necessarily true. The idea is to restrict the input in order to disallow common mistakes and to make further processing of the markup easier (by requiring end tags for example).

Differences with HTML 4.01 Strict

HTMLSyntaxChecker is designed to be used as a story input checker in an administrative front end of a database-driven PHP-based site. It is being deployed at Macsanomat (a Macintosh news site). The limitations are based on the needs of Macsanomat, but I think they make sense for other similar sites, too. Of course, the source can be customized as needed.

If an end tag is possible, it is required.
The attribute value delimiter is ". ' is not supported.
Forms are not allowed.
Scripts are not allowed.
The forced line break (<br>) is allowed only in headings, addresses and inside code samples.
Images aren’t allowed inside code samples.
The id attribute isn’t allowed in most cases.
The style attribute is forbidden. The style attribute is <font> in disguise.
<pre> is forbidden.

Known limitations

In tables only <table>, <tr>, <th> and <td> are supported.
Productions of the form (foo)+ are not supported properly. They are treated like (foo)*.
Checking for #REQUIRED attributes is still missing. There is only a temporary test for the alt attribute of <img>.
HTMLSyntaxChecker does not check the data types of attributes. (SGML validators don’t check them, either.)

Things to Modify

The syntax checker supports the full range of HTML 4.01 character reference entities. It also supports both decimal and hexadecimal NCR s. However, at Macsanomat we decided to remove the support for entities (except <, >, " and &) and for hexadecimal NCRs in order to reduce the number of ways a particular character may be represented in the database.

Also, it may be useful to add custom elements for internal use. Custom elements may be easier to manage internally and can be replaced with more complex HTML when the content is served out.

Usage

The source (available with syntax coloring) consists of a single class definition. When the source has been imported to the program, instances of HTMLSyntaxChecker may be constructed as follows:
$checker = new HTMLSyntaxChecker($HTMLstring, $rootModel);

Here $HTMLstring is a string containing the HTML input and $rootModel describes the content model of the implicit element in which the HTML input is considered to exist. The permitted values are "BLOCK", "INLINE", "FLOW" and "TEXT". These correspond to %Block;, %Inline;, %Flow; and %Text; of the HTML 4.01 Strict DTD. For example, if the root content model is set to "BLOCK", the contents $HTMLstring could appear in an HTML 4.01 Strict document between <body> and </body>.

Then the object can be used like this:
if($checker->succeeded()) { $caseNormalizedHTMLstring = $checker->getResult(); } else { echo $checker->getError(); }

The checker has three public methods:

succeeded(): Returns true if the syntax of the input was OK and false otherwise.
getResult(): Returns a case-normalized version of the input string. All the element and attribute names are folded to lower case.
getError(): Returns an error message as an HTML snippet that can be echoed out in any context where block content is allowed.