UTF-8 has won. Yet, Web authors have to opt in to having browsers treat HTML as UTF-8 instead of the browsers Just Doing the Right Thing by default. Why?
I’m writing this down in comprehensive form, because otherwise I will keep rewriting unsatisfactory partial explanations repeatedly as bug comments again and again. For more on how to label, see another writeup.
First of all, there is the “Support Existing Content” design principle. Browsers can’t just default to UTF-8 and have HTML documents encoded in legacy encodings opt out of UTF-8, because there is unlabeled legacy content, and we can’t realistically expect the legacy content to be actively maintained to add opt-outs now. If we are to keep supporting such legacy content, the assumption we have to start with is that unlabeled content could be in a legacy encoding.
In this regard,
<meta charset=utf-8> is just like
<!DOCTYPE html> and
<meta name="viewport" content="width=device-width, initial-scale=1">. Everyone wants newly-authored content to use UTF-8, the No-Quirks Mode (better known as the Standards Mode), and to work well on small screens. Yet, every single newly-authored HTML document has to explicitly opt in to all three, since it isn’t realistic to get all legacy pages to opt out.
But there is no single legacy encoding, so if we want to Support Existing Content, we need some way of deciding which one, and we know that given a document that is valid UTF-8, the probability that it was meant to be something other than UTF-8 is virtually zero. So if we decide which one of the legacy encodings we are dealing with not just by the top-level domain name (or the browser UI locale) but by examining the content, why not autodetect UTF-8?
The issue is not the difficulty of distinguishing UTF-8 from other encodings given the full content. In fact, when loading files from
file: URLs, Firefox does detect detect UTF-8! (Chrome does, too, but less reliably.) For
file: URLs, we sacrifice incremental loading on the assumption that most
file: URLs point to a local disk (as opposed to a file server mounted as if it was a local drive) which is fast enough that the user would not notice incremental loading anyway. We also assume that
file:-URL content is finite.
https: content, though, incremental processing is important and starting over is bad. Also, some pages intentionally never finish loading and need to be treated as infinite so we never have “full” content!
But we already wait for up to 1024 bytes (in Gecko; in WebKit and in Blink it is more complicated) to scan for
This assumes that there is some non-ASCII within the first 1024 bytes. Can we rely on non-ASCII pages to have the first bytes of non-ASCII within the first 1024 bytes? No.
The non-markup bytes are typically either in the general-purpose HTML
title element or in the
content attribute of the Facebook-purpose
meta property="og:title" element. Sadly, it is all too possible for these not to be within the first 1024 bytes, because before them, there are things like IE conditional comments, Facebook bogo-namespaces, a heap of
rel=preloads, over a dozen icons for iOS, copyright-related comments, or just scripts and stylesheets declared first.
What if we scanned until the start of
body like WebKit does for
meta charset (leaving aside for the moment how confidently we can locate the start of
body, which has an optional start tag before we start the real tokenization and tree building)? Surely
title is somewhere in
head, and the user cannot perceive incremental rendering until
body starts anyway.
So now we see
title while we’ve buffered up bytes and haven’t started the real encoding decoder, the tokenizer, or the tree builder. We can now detect from the content of
title, right? For non-Latin scripts, yes. Even just the page title in a non-Latin script is very likely enough to decide UTF-8ness. For languages like German or Finnish, no. Even though just about every German or Finnish document has non-ASCII, there’s a very real chance that the few words that end up in the title are fully ASCII. For languages like English, Somali, Dutch, Indonesian, Swahili, Somali, or various Malay languages you have even less hope of there being non-ASCII in the title than with German or Finnish even though there might be non-ASCII quotation marks, dashes, or a rare non-ASCII letter (such as a rare letter with diaeresis or acute accent) in a full document. For the World-Wide Web, a solution needs to work for these languages, too, and not just for non-Latin-script languages.
OK, so it seems that something more complicated is needed. Let’s think of fundamental requirements:
If Web authors think they can get away with not declaring UTF-8, many, many Web authors are going to leave UTF-8 undeclared. Therefore, we need a solution that works reliably in 100% of the case or we’d make the Web Platform more brittle. Timeouts are by definition dependent on something other than the content, so any solution that hand-waves some problem away by adding a timeout would be unreliable in this sense. Likewise, solutions that depend on how HTML content maps to network protocol buffer boundaries are inherently unreliable in this sense.
Also, making UTF-8 work undeclared should not regress performance compared to labeled UTF-8. A performance regression large enough to make the aggregate user experience worse (especially on slow CPUs and Internet connections) but small enough not to be noticed by authors (especially on fast CPUs and fast Internet connections) would be particularly unfortunate.
This gives us the following basic requirements:
Let’s look at what’s wrong with potential solutions. (As noted earlier, simply defaulting to UTF-8 without detection would fail to support existing unlabeled non-UTF-8 content. XHR and Fetch get to do this to avoid other problems, though, because they post-date UTF-8.)
OK, how about we scan until we’ve seen enough non-ASCII to decide, then? This doesn’t work, because for ASCII-only content it would mean buffering all the way to the end of the document, and ASCII-only content is real on the Web. That is, this would break incremental rendering e.g. for English. Trying to hand-wave the problem away using timeout would fail the requirement not to have timeouts. Trying to have a limit based and byte count would make the solution unreliable e.g. English content that has a copyright sign in the page footer or for Dutch content that has a letter with the diaeresis further from the page start than whatever the limit is.
Chrome already has detection for legacy encodings. How about detecting UTF-8 byte patterns, too?
This would not be at all reliable. Chrome’s detection is opportunistic from whatever bytes the HTML parser happens to have available up front. This means that the result not only depends on timing and network buffer boundaries but also fails to account for non-ASCII after a long ASCII prefix.
How about doing what Chrome does, but deciding UTF-8 if all the bytes available at the time of decision or ASCII?
This would break some existing unlabeled non-UTF-8 content with a long ASCII prefix. Additionally, the breakage would be dependent on timing and network buffer boundaries.
Firefox already has detection for legacy encodings. How about detecting UTF-8 byte patterns, too?
Firefox has a solution that does not depend on timing or network buffer boundaries and that can deal with long ASCII prefixes. If the meta prescan of the first 1024 bytes fails, Firefox runs the encoding detector on those 1024 bytes taking into account the top-level domain as an additional signal. If those bytes are all ASCII (and don’t contain an ISO-2022-JP escape sequence), Firefox at that point decides from the top-level domain. Upon encountering the end of the stream, Firefox guesses again now taking into account all the bytes. If the second guess differs from the first guess, the page is reloaded using the result of the second guess.
(The above description does not apply to the .jp, .in, and .lk TLDs. .jp has a special-purpose detector that detects among Japanese encodings only and triggers the reload, if needed, as soon as the decision is possible. .in and .lk fall back to windows-1252 without detection to accommodate old font hacks.)
When there’s a 1024-byte (or longer) ASCII prefix, reloading the page would regress performance relative to labeling UTF-8. Also, there is the additional problem that side effects of scripts (e.g. outbound XHR/Fetch) could be observed twice.
How about guessing UTF-8 instead of making a TLD-based guess when the first 1024 bytes are ASCII?
This solution would be better, but it would regress performance in the form of reloads for existing pages that currently don’t suffer such problems in order to allow UTF-8 to go undeclared for new pages. Furthermore, pages that load different-origin pages into iframes could be confused by those pages reloading on their own. Sure, this problem is already present in Firefox, but it occurs rarely thanks to the TLD-based guess being pretty good except for non-windows-1252 content on generic domains. This solution would make it occur for every unlabeled non-UTF-8 page with a 1024-byte ASCII prefix. Moreover, this would break legacy-encoded documents that never reach the end of the stream, such as pre-Web Socket chat response iframes.
Even for new unlabeled UTF-8 pages that would be a performance penalty relative to labeled UTF-8: The performance cost of processing all the bytes of the page using the detector.
Could we do something about the performance penalty for unlabeled UTF-8 content?
Yes, we could. First, the ASCII prefix is already skipped over using SIMD and without pushing to each detector state machine. We could define how many characters of given UTF-8 sequence length need to be seen in order to stay with UTF-8 and stop running the detector. In the case of two-byte UTF-8 sequences, seeing only one is not enough. In the case of three-byte UTF-8 sequences, maybe even one is enough. This would mitigate the concern of unlabeled UTF-8 suffering a performance penalty relative to labeled UTF-8.
However, this would still leave the issue of reloading non-UTF-8 pages that presently don’t need to be reloaded thanks to the TLD-based guess and the issue of breaking legacy-encoded pages that intentionally never reach the end of the stream.
What’s deal with the reloading anyway? An ASCII prefix decodes the same in both UTF-8 and in legacy encodings (other than UTF-16BE and UTF-16LE, which are handled on the BOM sniffing layer), so why not just pass the ASCII prefix through and make the detection decision afterwards?
That is, instead of treating decoding as a step that happens after detection, how about fusing the detector into a decoder such that the decoder streams ASCII through (to the HTML tokenizer) until seeing an ISO-2022-JP escape or a non-ASCII byte, and in the former case turns into a streaming ISO-2022-JP decoder immediately and in the latter case buffers bytes until the fused detector has confidently made its guess, turns into a decoder for the guessed encoding, outputs the buffer decoded accordingly, and thereafter behaves as a streaming decoder for the guessed encoding?
As with the observation that detecting UTF-8 is simple given access to the whole document, but things being complicated because document loading on the Web happens over time, things with the ASCII prefix are more complicated than they seem.
If the ASCII prefix is passed through to the HTML tokenizer, parsed, and the corresponding part of the DOM built before the encoding is decided, two issues need to be addressed:
<link rel=stylesheet>, or same-origin
<iframe>, and the encoding of the document inherits into those in case they turn out to lack encoding declarations of their own.
Does the second issue matter? Maybe it does, in which case passing through the ASCII prefix before deciding the encoding won’t work. However, more likely it doesn’t.
If it doesn’t, we can make up a special name signifying ongoing detection and expose it from
document.characterSet and inherit it into external scripts, stylesheets, and same-origin iframes. This means that detection expands from being an HTML loading-specific issue to being something that the script and style loaders need to deal with as well (i.e. they need to also run the detector if the special name is inherited).
If we were to go this route, we should use pre-existing IE special names. The generic detector should be called
_autodetect_all and the .jp TLD-specific detector should be called
_autodetect. (IE got Japanese detection in IE4. The generic detector was not in IE4 but was added by IE6 at the latest. Hence the Japanese case getting the shorter name.)
In addition to exposing non-encoding-name values via
document.characterSet and making detection spill over to the script and style loaders, this poses a problem similar to the earlier ASCII prefix problems: What if there’s a two-byte UTF-8 sequence, which on its own could be plausible as two German windows-1252 characters or as a single legacy CJK character, and then another long stretch of ASCII? For example, UTF-8 ®, which is reasonable in an English page title, maps to 庐 in GBK, 簧 in Big5, 速 in EUC-JP, and 짰 in EUC-KR. The characters land in the most common section (Level 1 Hanzi/Kanji or common Hangul) in each of the four encodings.
So if UTF-8 ® stops ASCII passthrough and starts buffering, because the character alone isn’t a conclusive sign of UTF-8ness, it is easy to break incremental rendering, since on an English page buffering until more non-ASCII characters are found could end up reaching the end of the stream.
The problem could be alleviated in a way that doesn’t depend on timing or on buffer boundaries. If the page indeed is German in windows-1252 or Chinese, Japanese, or Korean in a legacy encoding, there should be more UTF-8 byte sequences at a shorter distance from the previous one than in UTF-8 English (or Dutch, etc.). The German non-ASCII sequences will be relatively far apart, but it’s very improbable that the next occurrence of windows-1252 non-ASCII will also constitute a valid UTF-8 byte sequence. GBK, Big5, EUC-JP, and EUC-KR can easily have multiple consecutive two-byte sequences that are also plausible UTF-8 byte sequences. However, once non-ASCII starts showing up, more non-ASCII is relatively close and at some point, there will be a byte sequence that’s not valid UTF-8.
It should be possible to pick a number such that if the detector has seen non-ASCII but hasn’t yet decided UTF-8 vs. non-UTF-8, if it subsequently sees more ASCII bytes in a row than the chosen number, it decides UTF-8.
Apart from making the bet that exposing weird values from
document.characterSet wouldn’t break the Web, the solution sketched above would involve behaviors that none of Gecko, Blink, or WebKit currently have. Just letting Web authors omit labeling UTF-8 does not seem like a good enough reason to introduce such complexity.
text/plain can’t use
The case against doing this is less strong than in the HTML case. However, it’s a slippery slope. It would be bad for Firefox to do this unilaterally and to provoke Chrome to do more detection if it meant Chrome picking one of the easy-for-Chrome brittle options from the start of the above list instead of doing something robust and cross-vendor-agreed-upon.