HTMLSyntaxChecker
is a PHP class for checking the syntax of HTML input. HTMLSyntaxChecker
is a linter. It is not a validator in the SGML sense. The design goal is
that the input HTMLSyntaxChecker
accepts is also valid HTML 4.01. However,
the opposite is not necessarily true. The idea is to restrict the input in
order to disallow common mistakes and to make further processing of the markup
easier (by requiring end tags for example).
HTMLSyntaxChecker is designed to be used as a story input checker in an administrative front end of a database-driven PHP-based site. It is being deployed at Macsanomat (a Macintosh news site). The limitations are based on the needs of Macsanomat, but I think they make sense for other similar sites, too. Of course, the source can be customized as needed.
<br>
) is allowed only in headings, addresses and inside code samples.id
attribute isn’t allowed in most cases.style
attribute is forbidden. The style
attribute is <font>
in disguise.<pre>
is forbidden.<table>
, <tr>
, <th>
and <td>
are supported.(foo)+
are not supported properly. They are treated like (foo)*
.#REQUIRED
attributes is still missing. There is only a temporary test for the alt
attribute of <img>
.The syntax checker supports the full range of HTML 4.01 character reference entities. It also supports both decimal and hexadecimal NCR s. However, at Macsanomat we decided to remove the support for entities (except <, >, " and &) and for hexadecimal NCRs in order to reduce the number of ways a particular character may be represented in the database.
Also, it may be useful to add custom elements for internal use. Custom elements may be easier to manage internally and can be replaced with more complex HTML when the content is served out.
The source (available with syntax coloring) consists of a single class definition. When the source has been imported to the program, instances of HTMLSyntaxChecker
may be constructed as follows:
$checker = new HTMLSyntaxChecker($HTMLstring, $rootModel);
Here $HTMLstring
is a string containing the HTML input and $rootModel
describes the content model of the implicit element in which the HTML input is considered to exist. The permitted values are "BLOCK"
, "INLINE"
, "FLOW"
and "TEXT"
. These correspond to %Block;
, %Inline;
, %Flow;
and %Text;
of the HTML 4.01 Strict DTD. For example, if the root content model is set to "BLOCK"
, the contents $HTMLstring
could appear in an HTML 4.01 Strict document between <body>
and </body>
.
Then the object can be used like this:
if($checker->succeeded()) {
$caseNormalizedHTMLstring = $checker->getResult();
} else {
echo $checker->getError();
}
The checker has three public methods:
succeeded()
true
if the syntax of the input was OK and false
otherwise.getResult()
getError()