The Unicode Technical Committee recently decided to change their long-standing guidance for the preferred number of REPLACEMENT CHARACTERs generated for bogus byte sequences when decoding UTF-8. I think this change is inappropriate because it was based on mere aesthetic considerations and ICU’s behavior and goes against the behavior of multiple prominent existing implementations that implemented the long-standing previous guidance.
Not all byte sequences are valid UTF-8. When decoding potentially invalid UTF-8 input into a valid Unicode representation, something has to be done about invalid input. One approach is to stop altogether and to signal an error upon finding invalid input. While this is a valid response for some applications, it is not our topic today. The topic at hand is what to do in the non-Draconian case where the decoder continues even after discovering invalid input.
The naïve answer is to ignore invalid input until finding valid input again (i.e. finding the next byte that has a lead-byte value), but this is dangerous and should never be done. The danger is that silently dropping bogus bytes might make a string that didn’t look dangerous with the bogus bytes present become valid active content. Most simply,
<scr�ipt> (� standing in for a bogus byte) could become
<script> if the error is ignored. So it’s non-controversial that every sequence of bogus bytes should result in at least one REPLACEMENT CHARACTER and that the next lead-valued byte is the first byte that’s no longer part of the invalid sequence.
But how many REPLACEMENT CHARACTERs should be generated for a sequence of multiple bogus bytes?
Unicode 9.0.0 (page 127) says: “An ill-formed subsequence consisting of more than one code unit could be treated as a single error or as multiple errors. For example, in processing the UTF-8 code unit sequence <F0 80 80 41>, the only formal requirement mandated by Unicode conformance for a converter is that the <41> be processed and correctly interpreted as <U+0041>. The converter could return <U+FFFD, U+0041>, handling <F0 80 80> as a single error, or <U+FFFD, U+FFFD, U+FFFD, U+0041>, handling each byte of <F0 80 80> as a separate error, or could take other approaches to signalling <F0 80 80> as an ill-formed code unit subsequence.” So as far as Unicode is concerned, any number from one to the number of bytes in the number of bogus bytes (inclusive) is OK. In other words, the precise number is implementation-defined as far as Unicode is concerned.
Yet, immediately after saying that there isn’t conformance requirement for the precise number, the Unicode Standard proceeds to express a preference motivating it by saying: “To promote interoperability in the implementation of conversion processes, the Unicode Standard recommends a particular best practice.” The “best practice” (until the recent change) was that a maximal invalid sequence of bytes that forms a prefix of a valid sequence is collapsed into one REPLACEMENT CHARACTER and otherwise there is one REPLACEMENT CHARACTER per each bogus byte.
The old preference makes sense when the UTF-8 to decoder is viewed as a state machine that recognizes UTF-8 as a regular grammar based on the information presented in table 3-7 “Well-Formed UTF-8 Byte Sequences” in the Unicode Standard (page 125 in version 9.0.0; quoted below) and exhibits the following behavior when encountering a byte that doesn’t fit the grammar at the current state:
The conclusion here is that when viewing the UTF-8 decoder as a state machine that encodes knowledge of what byte sequences are valid, the old preference makes perfect sense. In particular, the rule to collapse prefixes of valid sequences is not added complexity but the simple thing arising from not requiring the state machine to unconsume more than the one byte under examination.
|Code Points||First Byte||Second Byte||Third Byte||Fourth Byte|
On May 12 2017, the Unicode Technical Committee accepted a proposal (for Unicode 11) to collapse sequences of bogus bytes to a single REPLACEMENT CHARACTER not only when they form a prefix of a valid sequence but also when the bogus bytes fit as a prefix of the general UTF-8 bit pattern. The bit pattern for one, two, three and four-byte sequences is given in table 3-6 “UTF-8 Bit Distribution” in the Unicode Standard (page 125 in version 9.0.0; quoted below). The proposal is ambiguous about whether to do the same thing for five and six-byte sequences whose bit pattern is not defined as existing in Unicode but was defined in now-obsolete RFCs for UTF-8, the last RFC defining them being RFC 2279.
If five and six-byte sequences are treated according to the logic of the newly-accepted proposal, the newly-accepted proposal matches the behavior of ICU. If the decoder is supposed to be unaware of five and six-byte patterns, which are non-existent as far as Unicode is concerned, I am not aware of any implementation matching the new guidance.
The rationale against the old guidance was “I believe the best practices are wrong” and the rationale in favor of the new guidance was “feels right”. (Really.)
|Scalar Value||First Byte||Second Byte||Third Byte||Fourth Byte|
|000uuuuu zzzzyyyy yyxxxxxx||11110uuu||10uuzzzz||10yyyyyy||10xxxxxx|
The new preference makes sense if the UTF-8 decoder is viewed as a bit accumulator that first consumes bytes according to the UTF-8 bit distribution pattern and masks and shifts the variable bits into an accumulator where they form a scalar value and then upon completing a sequence according to the bit distribution pattern checks if the scalar value is valid given the length of sequence consumed. Scalar value for the surrogate range or above the Unicode range is always invalid and otherwise scalar values are invalid if the scalar value could be represented as a shorter sequence of bytes than a sequence that was actually consumed.
It is worth noting that the concept of accumulating a scalar value during UTF-8 decoding is biased towards using UTF-16 or UTF 32 as the in-memory Unicode representation, since decoding to those forms necessarily involves the use of such an accumulator. When decoding UTF-8 to UTF-8 as the in-memory Unicode representation, while it is possible to first accumulate the scalar value and then re-encode it as UTF-8, it is unnecessary and inefficient and the sort of validation state machine described above makes more sense. Sure, such a state machine could be extended to exhibit the outward behavior of the formulation that involves a scalar accumulator, but it would be extra complexity in service of replicating the behaviors arising from a different model.
So who cares? It’s just a non-normative expression of preference.
There are multiple reasons to care.
If anything “It’s not a requirement.” should be taken as an argument why the spec doesn’t need changing and not as an argument why changes on flimsy grounds are OK. The realization that the Unicode Consortium seems to lack a strong objective reasoning to prefer a particular number of REPLACEMENT CHARACTERs could be taken to support a conclusion that maybe it would be the best for the Unicode Standard not to express a preference (called “best practice” implying the preference is in some sense the “best” option) on this topic, but it does not support the conclusion that changing the expressed behavior whichever way is OK.
But most importantly, even if the issue of the exact number of REPLACEMENT CHARACTERs in itself was not really that important to care about and it might seem silly to write this much about it, I see changing a widely-implemented spec on flimsy grounds as poor standard stewardship and I wish the Unicode Consortium did better than that so that this kind of failure to investigate what multiple implementations do does not repeat with something more important.
I tested the following implementations (Web browsers by visual inspection and others by performing the conversion and using
diff on the output):
MultiByteToWideCharon Windows 10 1607)
Why these? Browsers should obviously be considered. I already had copypaste-ready code for ICU, Win32, rust-encoding and the Rust standard library. Java, Ruby, Python 3, Python 2 and Perl 5 were trivial to test due to packaging on Ubuntu and either prior knowledge of how to test or the documentation being approachable. On the other hand,
I timed out trying to find the right API entry point in Go documentation, and I timed out trying to get GLib to behave (the glibc layer does not handle REPLACEMENT CHARACTER emission). I figured that testing e.g. CoreFoundation (which I believe only wraps ICU but, who knows, could do something else for UTF-8 like WebKit does), Qt or .Net would have taken too much time for me. In any case, the above list should be broad enough to make statements about “multiple prominent implementations”.
I used a specially-crafted HTML document as test input. Since the file is malformed, if you follow the link in your browser, the number of REPLACEMENT CHARACTERs depends on the UTF-8 decoder in your browser.
Most implementations produced bit-identical results matching the old Unicode preference. Therefore, I’m providing only one copy of the output. This file is valid UTF-8, so the number of REPLACEMENT CHARACTERs is encoded in the file and does not depend on your browser. This is the result obtained with (by visual inspection only for browsers):
An interesting browser behavior (link to manual synthesis of valid UTF-8 that looks the way the test input shows up in these browsers) is to emit as many REPLACEMENT CHARACTERs as there are bogus bytes without collapsing sequences that are prefixes of a valid sequence. This is the behavior of:
The rest all had mutually different results (links point to valid output created with these implementations):
As you can see, ICU is the most different from the others. In particular, even though Edge, IE11, Safari, OpenJDK 8 and Perl 5 do not follow the old guidance for everything, they match the old guidance for non-shortest forms.
When there are multiple prominent implementations following the old guidance and only ICU following the new guidance if it is taken to include five and six-byte sequences and no implementation (that I know of) following the new guidance if it is taken to exclude five and six-byte sequences, I think changing the spec shows poor standard stewardship.
It is wasteful if the implementors who followed the previous advice need to explain why they don’t follow the new “best practice”. It should be to the developers of the other implementations shouldering the burden of explaining their deviations from the “best practice”. It is even more wasteful if the change of Unicode Standard-expressed preference results in code changes in any of the implementations that implemented the old preference, since this would result in implementation and quality assurance work (potentially partially in the form of fixing bugs introduced as part of making changes).
A well-managed standard should not induce, for flimsy reasons, such waste on implementors who trusted the standard previously. Changes to widely-implemented long-standing standards should have a very important and strong rationale. The change at hand lacks such a strong rationale.
Now that ICU is a Unicode Consortium project, I think the Unicode Consortium should be particularly sensitive to biases arising from being both the source of the spec and the source of a popular implementation. It looks really bad both in terms of equal footing of ICU vs. other implementations for the purpose of how the standard is developed as well as the reliability of the standard text vs. ICU source code as the source of truth that other implementors need to pay attention to if the way the Unicode Consortium resolves a discrepancy between ICU behavior and a well-known spec provision (even if mere expression of preference that isn’t a conformance requirement) is by changing the spec instead of changing ICU.
Even though the Unicode Standard expresses the number of REPLACEMENT CHARACTERs as a mere non-normative preference, Web standards these days tend to avoid implementation-defined behavior due to the painful experience of Web sites developing dependencies on the quirks of particular browsers in areas that old specs considered error situations not worth spelling out precise processing requirements for. Therefore, there has been a long push towards well-defined behavior even in error situations on the Web without debating each particular error case individually to assess if the case at hand is prone to sites developing dependencies on a particular behavior. (To be clear, I am not claiming that the number of REPLACEMENT CHARACTERs would be particularly prone to browser-specific site dependences.)
As a result, the WHATWG Encoding Standard, which seeks to be normative over Web browsers, has precise requirements for REPLACEMENT CHARACTER emission. For UTF-8, these used to differ from the preference expressed by the Unicode Standard, but that was reported as a bug in 2012 and the WHATWG Encoding Standard aligned with the preference expressed by the Unicode Standard making it a requirement. Firefox was changed accordingly at the same time.
Chrome changed in 2016 to bring the Web Template Framework part of Chrome into consistency with V8, since a discrepancy between the two caused a bug! The change cited Unicode directly instead of citing the WHATWG Encoding Standard. This Chrome bug is the strongest evidence that I have seen that the precise behavior can actually matter.
Regrettably, the distance to make all browsers do the same thing would have been the shortest before the Chrome change. Before then, the shortest path to making all browsers do the same thing would have been to make V8 and Firefox emit one REPLACEMENT CHARACTER per bogus byte. However, I am not advocating such a change now. The V8 consistency issue shows that UTF-8 decoding comes to browsers from more places in the code than one would expect, some of those are implemented directly downstream of the Unicode Standard instead of downstream of the WHATWG Encoding Standard and consistency between those can turn out to matter. In the case of Firefox, there is the Rust standard library in addition to the main encoding library (currently uconv, encoding_rs hopefully in the near future), and I don’t want to ask the Rust standard library to change. (Also, from a purely selfish perspective, replicating the Edge/Safari behavior in encoding_rs, while possible, would lead to added complexity, because it would involve emitting multiple REPLACEMENT CHARACTERs retroactively for bytes that the decoder has already consumed as a valid-looking prefix.)
When two out of four of the major browsers match what the WHATWG Encoding Standard says about UTF-8 decoding and the two others are very close, the WHATWG spec is likely to stay as-is. It’s a shame if the Unicode change makes a conformance requirement for the Web differ from the non-requirement preference expressed by the Unicode Standard. There are already enough competing specs and stale spec versions around that making sure that browser developers read the right spec requires constant vigilance. It would be sad to have to add this particular part of the Unicode Standard to the “wrong specs to read” list. Even worse if the Unicode change leads to more bugs like the discrepancy between V8 and WTF within Chrome.
This is mostly of curiosity value but may be of relevance for the purpose of getting the decision overturned. It appears that the agenda for a Unicode Technical Committee meeting is supposed to be set at least a week in advance of the meeting, but the proposal at issue here seems to have been submitted on a shorter notice (proposal dated May 11 and accepted on May 12). Also, the old preference was formulated as the outcome of a more heavy-weight Public Review Issue process, so it seems inappropriate to change the outcome of a heavy-weight process by using a lighter-weight decision process.
First, I hope that the decision to change the preference that the Unicode Standard expresses for the number of REPLACEMENT CHARACTERs is overturned on appeal for the above reasons and for the bad precedent that the change is suggestive of when viewed as a slippery slope towards changing more important things on flimsy grounds. (Or, alternatively, I hope the Unicode Standard stops expressing a preference for the number of REPLACEMENT CHARACTERs altogether beyond “at least one and no more than the number of bogus bytes”.)
Second, I hope that the Unicode Consortium takes steps to mitigate the risk of making decisions on flimsy grounds in the future by requiring proposals to change text concerning implementation behavior (regardless of whether an actual requirement or a mere expression of preference) to come with a survey of the behavior of a large number of prominent existing implementations. The more established a given behavior is in implementations, the stronger the rationale should be to change the required or preferred behavior. The Unicode Consortium-hosted implementation should have no special weight when considering the existing implementation landscape.
That is, I think the proposal to change the preferred behavior in this case should have come with the kind of investigation that I performed here, preferably considering even more implementations, instead of baiting someone other than the person making the proposal to do the investigation after the decision has already been taken.