I am just writing this down so I don’t forget it. There are no immediate implementation plans. There are no implementation promises, either. There especially are no hosting promises at this time. However, I have cross-posted this to the WHATWG wiki for comments.
First, I assume there is some level of interest in doing RELAX NG / Schematron validation and HTML5 conformance checking. Next, it would be nice to enable applications that deal with documents to make these checks automatically in addition to having the functionality available for human operators as a Web app. For example, a content management system might check the input it is given.
Java apps could just integrate a private copy of the Free Software back end of the validation / conformance checking service. However, non-Java apps would benefit from having the validation / conformance checking service running out of process and having an interface for talking to the out-of-process Java service. The service instance could be hosted publicly or as a local copy. Even some Java developers would elect to use such a service instead of integrating the back end as part of their own app.
The schemas are expected to be relatively static. Therefore, I think preloading them into the service or letting the service retrieve them is sufficient. Identification by URI works in both cases.
What needs different input modes is the document that is checked.
I think the following modes would make sense:
Document URI as a GET parameter; the service retrieves the document by URI (already implemented).
Document in a data:
URI as a GET parameter.
Document POSTed as the HTTP entity body (the preferred Web service mode).
Document POSTed as an application/x-www-form-urlencoded
form field value.
Document POSTed as a multipart/form-data
file
upload.
In the first three modes, additional parameters would be
communicated in the URI query string. In the last two modes,
additional parameters would be communicated like corresponding from
fields are communicated as application/x-www-form-urlencoded
and multipart/form-data
.
I don’t particularly like the last two modes, but they are
needed to address feature requests and for parity with other
services. Also, unlike the first three modes, the last two modes need
companion UI changes, which is not nice. As a further complication,
the last two don’t come naturally with a Content-Type
for dispatching to an HTML5 parser or to an XML parser.
All these input modes would share the same “service endpoint
URI” (and the same servlet class). The different cases can be
distinguished from the HTTP method and in the POST cases from the
Content-Type
request header.
A Web service probably calls for an XML output format for maximal tool chain integration even though the current HTML output format makes sense for browsers and can carry all the necessary data.
I think the following modes would make sense:
HTML with microformat-style class
annotations
(already implemented except the annotation granularity could be
better).
XHTML with microformat-style class
annotations.
A custom XML format that it super-simple and use element names for easier processing with tools that are biased towards keying on element name rather than on attribute value.
For the HTML and XHTML output formats, there could be an option for suppressing the input form. The output default should be HTML for the browser-targeted input formats. However, the custom XML format might be a reasonable default when the input document was POSTed as the entity body.
The elements in this XML vocabulary are in the namespace
“http://hsivonen.iki.fi/validator/messages/
”. The
attributes in this XML vocabulary are not in a namespace. The
attribute values defined for this XML vocabulary must not have
preceding or trailing white space.
Note: The format has been designed to support streaming generation and consumption.
The format consists of an XML 1.0 document that has the element
messages
as the root element.
The root element may zero or more child elements named info
,
warning
and error
. The element info
means an informational message. The element warning
signifies a potential problem that does not cause the
validation/checking to fail. The element error
signifies
a problem that causes the validation/checking to fail. The character
data content of these three elements may contain a human-readable
message. (Entity-escaped HTML is not allowed. :-)
The elements info
, warning
and error
have three optional attributes for indicating the context of the
message: uri
, line
and column
.
The column
attribute must not be present unless the line
attribute is present as well.
The uri
attribute, if present, must containt the URI
(not IRI) of the HTTP resource with which the message is associated
or the literal string “data:…
” (the last character
is U+2026) to signify that the message is associated with a data URI
resource but the exact URI has been omitted. (If a client application
wishes to show IRIs to human users, it is up to the client
application to convert the URI into an IRI.)
The line
attribute, if present, must contain a string
consisting of characters in the range U+0030 DIGIT ZERO to U+0039
DIGIT NINE which when interpreted as a base-ten integer is a positive
integer (not zero). This number means the approximate source text
line number associated with the message. The first line is 1.
The column
attribute, if present, must contain a
string consisting of characters in the range U+0030 DIGIT ZERO to
U+0039 DIGIT NINE which when interpreted as a base-ten integer is a
positive integer (not zero). This number means the approximate source
column number associated with the message on the line indicated by
the line
attribute. The first character on a line is in
column 1.
The source lines and columns are approximate. For example, if a message is related to an attribute, the line and column may point to the first character if the start tag, the character after the start tag or to the attribute inside the tag depending on implementation. If a message is related to character data, the line and column may be inaccurate within a run of text e.g. due to buffering. Furthermore, implementation may count column numbers in terms of UTF-16 code units instead of characters.
The error
element may have an attribute called type
for indicating that an error is not a general error. Permissible
values for the type
attribute are: fatal
(signifies a well-formedness violation or another error after which
no more checking was performed), io
(signifies an
input/output error), schema
(indicates that
initializing a schema-based validator failed) and internal
(indicates that the validator/checker found an error bug in itself,
ran out of memory, etc., but was still able to emit a message).
The validation/checking is considered to have failed if there is
one or more error
element.
Clients that consume the message format are referred to as processors. They must use a conforming XML 1.0 processor to parse the format.
If the root element is not an element named messages
,
the document is deemed to be in an unknown format and not processable
according to this processing model.
If a processor encounters an element that it doesn’t recognize,
it must process the content of the element as if the start tag and
the end tag of the element were not there. If the processor encounter
character data as a child of the root element (after applying the
rule stated in the previous sentence), it must act as if the
character data was not there. If a processor encounters an attribute
that it does not recognize, it must ignore the entire attribute. If a
processor encounters an attribute that it does recognize but the
value of the attribute is not permissible under the previous section,
the processor must ignore the entire attribute. If an info
,
warning
or error
element does not have a
line
attribute with a permissible value, a column
attribute on the element must be ignored if present.
Note: These rules make it possible to add markup for source code dumps, document outlines and parse trees later without breaking clients. Also, it make it possible to introduce e.g. XHTML markup in the human-readable messages.
Processors must process elements in a way that is consistent with the semantics of the elements.
The determine if the validation/checking succeeded, processors
must determine whether the root element has no error
element children. If there are no error
children, the
validation/checking succeeded. Otherwise, it failed.
The W3C has defined two XML output formats for the W3C Validator: the SOAP format and the Unicorn format. I think there are two problems with these formats: they are unnecessarily complex and they don’t support streaming output. For example, they require a redundant declaration of the number of errors before the errors themselves (which a client could count on its own if it wants to know the number).
The W3C Validator also provides simple pass/fail information as HTTP headers, which is nice if you only care about a boolean pass/fail. However, this approach also has the problem the it precludes streaming, because the validation process has to finish before the HTTP headers can be written.
For these reasons, I am not particularly keen on reusing the output formats of the W3C Validator unless it turns out that there are significant network benefits to be reaped from plugging into an existing network of client software. It seems to me that there isn’t a significant network of existing client software.