Regular Expressions, Computer Science and Practice

“What does all this mean to you, as a user? Absolutely nothing. As a user, you don't care if it's regular, nonregular, unregular, irregular, or incontinent. So long as you know what you can expect from it (something this chapter will show you), you know all you need to care about.”
– Jeffrey Friedl, author of Mastering Regular Expressions, under the heading “NFA: Theory vs. Reality”

Theory—Education

My computer science curriculum included a course on compilers. We were taught that you can turn a regular expression into an NFA using Thompson’s Construction. We were also taught that you can convert any NFA into a DFA that you can actually run. It all made sense.

I knew that Perlish “regular expressions” aren’t really regular expressions in the computer science sense, but I thought that the implementors knew what they were doing and the extensions somehow fit nicely into the model of compiling into a DFA by using an NFA as an intermediate step. I sincerely thought that the explicit pattern compilation step in Java and Python built a reusable DFA representation that’d guarantee constant-space and O(n) time (where n is the length of the string being matched) when actually running the matcher.

I was so naïve.

Reality—Stack Trace

Validator.nu uses the Xerces regular expression implementation when XSD regular expressions are used in a RELAX NG schema. Validator.nu also sends me email when the validation engine crashes. (Thanks to a managed runtime, this doesn’t crash the app or even make the thread die.)

I got email that indicated that someone used the CMLComp Validator to check a CML file using Validator.nu as the back end and Validator.nu experienced a stack overflow:

java.lang.StackOverflowError
        at org.apache.xerces.impl.xpath.regex.RegularExpression.matchString(Unknown Source)
        at org.apache.xerces.impl.xpath.regex.RegularExpression.matchString(Unknown Source)
        at org.apache.xerces.impl.xpath.regex.RegularExpression.matchString(Unknown Source)

And the same line just repeats over and over until the log4j emailer cutoff.

This clearly shows that the regular expression matching stage is recursive. Moreover, the depth of recursion depends on the CML document containing the string being matched and not the schema containing the regular expression. Otherwise, the stack would overflow all the time.

This doesn’t look like DFA constant-space and O(n) time behavior. Hmm…

Explanation—Paper

Via Tim Bray, I stumbled upon a paper that explains this all. The popular regular expression implementations that have capturing groups and, more importantly, back-references to these groups use a recursive implementation on the NFA without converting to a DFA and this recursive implementation has dramatically worse worst-case space and time complexity at match time than a DFA. Oops.

One might argue that taking something that is already very useful with an O(n) implementation and adding a feature that makes it O(2ⁿ)—even if only in the worse case—is a bad idea. And even if that additional feature itself is useful when you need it, it is still bad to have to risk exponential behavior when you are not using that feature.

What is particularly annoying in the case of XSD regular expressions is that XSD regular expressions are actually pure regular expressions without back-references. In fact, they don’t even have capturing groups. They are purely about testing whether a string belongs in the regular language expressed by the regular expression. Taken on their own right, it makes no sense to implement XSD regular expressions using the exponential recursive algorithm that makes sense with back-references.

XSD regular expressions are not taken on their own right in Xerces, though. Instead, a more general regular expression engine is adapted to implement XSD regular expressions and, therefore, XSD regular expressions suffer from the latent back-referencing ability of the engine.

LazyWeb Request

If you have an open-source Unicode regular expression engine written in Java that compiles the pattern into a thread-safe DFA against which multiple matchers can run concurrently later, please let me know. Even better if the engine already complies with the XSD regular expression syntax. (Implementation note for porting old C implementations: Traditional 256-entry next-transition tables don’t work with Unicode, and transitions probably need to be labeled by something like ICU4J UnicodeSet objects.)

Update: It was pointed out to me that byte-oriented approaches do work with UTF-8. Indeed, I should have mentioned that the matcher could run on UTF-8 string representation but I thought character classes would behave badly in that case. I do not know if there’d be actual badness with character classes and most schema regular expressions are probably want to match Basic Latin anyway. (Except with XSD regular expressions it is easy to match more by accident—for example, the definition of \s is politically correct but not that practical.)

Update: The LazyWeb request above worked. I have been informed that there is an Open Source DFA regular expression implementation for Java.

Update: Björn Höhrmann has created Open Source DFA recognizer operating on UTF-8 for recognizing sets of Unicode characters for Perl.