chardetng is a new small-binary-footprint character encoding detector for Firefox written in Rust. Its purpose is user retention by addressing an ancient—and for non-Latin scripts page-destroying—Web compat problem that Chrome already addressed. It puts an end to the situation that some pages showed up as unreadable garbage in Firefox but opening the page in Chrome remedied the situation more easily than fiddling with Firefox menus. The deployment of chardetng in Firefox 73 marks the first time in the history of Firefox (and in the entire Netscape lineage of browsers since the time support for non-ISO-8859-1 text was introduced) that the mapping of HTML bytes to the DOM is not a function of the browser user interface language. This is accomplished at a notably lower binary size increase than what would have resulted from adopting Chrome’s detector. Also, the result is more explainable and modifiable as well as more suitable for standardization, should there be interest, than Chrome’s detector
chardetng targets the long tail of unlabeled legacy pages in legacy encodings. Web developers should continue to use UTF-8 for newly-authored pages and to label HTML accordingly. This is not a new feature for Web developers to make use of!
Although chardetng first landed in Firefox 73, this write-up discusses chardetng as of Firefox 78.
There is a long tail of legacy Web pages that fail to label their encoding. Historically, browsers have used a default associated with the user interface language of the browser for unlabeled pages and provided a menu for the user to choose something different.
In order to get rid of the character encoding menu, Chrome adopted an encoding detector (apparently) developed for Google Search and Gmail. This was done without discussing the change ahead of time in standard-setting organizations. Firefox had gone in the other direction of avoiding content-based guessing, so this created a Web compat problem that, when it occurred for non-Latin-script pages, was as bad as a Web compat problem can be: Encoding-unlabeled legacy pages could appear completely unreadable in Firefox (unless a well-hidden menu action was taken) but appear OK in Chrome. Recent feedback from Japan as well as world-wide telemetry suggested that this problem still actually existed in practice. While Safari detects less, if a user encounters the problem on iOS, there is no other engine to go to, so Safari can’t be used to determine the abandonment risk for Firefox on platforms where a Chromium-based browser is a couple of clicks or taps away. Edge’s switch to Chromium signaled an end to any hope of taking the Web Platform in the direction of detecting less.
ICU4C’s detector wasn’t accurate (or complete) enough. Firefox’s old and mostly already removed detector wasn’t complete enough and completing it would have involved the same level of work as writing a completely new one. Since Chrome’s detector, ced, wasn’t developed for browser use cases, it has a larger footprint than is necessary. It is also (in practice) unmodifiable over-the-wall Open Source, so adopting it would have meant adopting a bunch of C++ that would have had known-unnecessary bloat while also being difficult to clean up.
Developing an encoding detector is easier and takes less effort than it looks once one has made the observations that one makes when developing a character encoding conversion library. chardetng’s foundational binary size-reducing idea is to make use of the legacy Chinese, Japanese, and Korean (CJK) decoders (that a browser has to have anyway) for the purpose of detecting CJK legacy encodings. chardetng is also strictly scoped to Web-relevant use cases.
On x86_64, the binary size contribution of chardetng (and the earlier companion Japanese-specific detector) to libxul is 28% of what ced would have contributed to libxul. If we had adopted ced and later wanted to make a comparable binary size saving, hunting for savings around the code base would have been more work than writing chardetng from scratch.
The focus on binary size makes chardetng take 42% longer to process the same data compared to ced. However, this tradeoff makes sense for code that runs for legacy pages and doesn’t run at all for modern pages. The accuracy of chardetng is competitive with ced, and chardetng is integrated into Firefox in a way that gives it a better opportunity to give the right answer compared to the way ced is integrated into Chrome.
chardetng has been developed in such a way that it would be standardizable, should there be interest to standardize it at the WHATWG. The data tables are under CC0, and non-CC0 part of chardetng consists of fewer than 3000 lines of explainable Rust code that could be reversed into spec English.
Before getting to how chardetng works, let’s look in more detail at why it exists at all.
Back when the Web was created, operating systems had locale-specific character encodings. For example, a system localized for Greek had a different character encoding from a system localized for Japanese. The Web was created in Switzerland, so bytes were assumed to be interpreted according to ISO-8859-1, which was the Western European encoding for Unix-ish systems and also compatible with the Western European encoding for Windows. (The single-byte encodings on Mac OS Classic were so different from Unix and Windows that browsers on Mac had to permute the bytes and couldn’t assume content from the Web to match the system encoding.)
As browsers were adapted to more languages, the default treatment of bytes followed the system-level design of the textual interpretation of bytes being locale-specific. Even though the concept of actually indicating what character encoding content was in followed relatively quickly, the damage was already done. There were already unlabeled pages out there, so—for compatibility—browsers retained locale-specific fallback defaults. This made it possible for authors to create more unlabeled content that appeared to work fine when viewed with an in-locale browser but that broke if viewed with an out-of-locale browser. That’s not so great for a system that has “world-wide” in its name.
For a long time, browsers provided a menu that allowed the user to override the character encoding when the fallback character encoding of the page author’s browser and the fallback character encoding of the page reader’s browser didn’t match. Browsers have also traditionally provided a detector for the Japanese locale for deciding between Shift_JIS and EUC-JP (also possibly ISO-2022-JP) since there was an encoding split between Windows and Unix. (Central Europe also had such an encoding split between Windows and Unix, but, without a detector, Web developers needed to get their act together instead and declare the encodings…) Firefox also provided an on-by-default detector for the Russian and Ukrainian locales (but not other Cyrillic locales!) for detecting among multiple Cyrillic encodings. Later on, Firefox also tried to alleviate the problem by deciding from the top-level domain instead of the UI localization when the top-level domain was locale-affiliated. However, this didn’t solve the problem for .com/.net/.org, for local files, or for locales with multiple legacy encodings even if not on the level of prevalence as in Japan.
As part of an effort to get rid of the UI for manually overriding the character encoding of an HTML page, Chrome adopted compact_enc_det (ced), which appears to be Google’s character encoding detector from Google Search and Gmail. This was done as a surprise to us without discussing the change in standard-setting organizations ahead of time. This led to a situation where Firefox’s top-level domain or UI locale heurististics could fail but Chrome’s content-based detection could succeed. It is likely more discoverable and easier to launch a Chromium-based browser instead of seeking to remedy the situation within Firefox by navigating rather well-hidden submenus.
When the problem occurred with non-Latin-script pages, it was about as bad as a Web compat problem can be: The text was completely unreadable in Firefox and worked just fine in Chrome.
Indeed, this issue has not been a practical problem for quite a while for users who are in windows-1252 locales and read content from windows-1252 locales, where a “windows-1252 locale” is defined as a locale where the legacy “ANSI” code page for the Windows localization for that locale is windows-1252. There are two reasons why this is the case. First, the Web was created in this locale cohort (windows-1252 is a superset of ISO-8859-1), so the defaults have always been favorable. Second, when the problem does occur, the effect is relatively tolerable for the languages of these locales. These locales use the Latin script with relatively infrequent non-ASCII characters. Even if those non-ASCII characters are replaced with garbage, it’s still easy to figure out what the text as a whole is saying.
With non-Latin scripts, the problem is much more severe, because pretty much all characters (except ASCII spaces and ASCII punctuation) get replaced with garbage.
Clearly, when the user invokes the Text Encoding menu in Firefox, this user action is indicative of the user having encountered this problem. All over the world, the use of the menu could be fairly characterized as very rare. However, how rare varied by a rather large factor. If we take the level of usage in Germany and Spain, which are large non-English windows-1252 locales, as the baseline of a rate of menu usage that we could dismiss as not worth addressing, the rate of usage in Macao was 85 times that (measured not in terms of accesses to the menu directly but in terms Firefox subsessions in which the menu had been used at least once; i.e. a number of consecutive uses in one subsession counted just once). Ignorable times 85 does not necessarily equal requiring action, but this indicates that an assessment of the severity of the problem from a windows-1252 experience is unlikely to be representative.
Also, the general rarity of the menu usage is not the whole story. It measures only the cases where the user knew how and bothered to address the problem within Firefox. It doesn’t measure abandonment: It doesn’t measure the user addressing the problem by opening a Chromium-based browser instead. For every user who found the menu option and used it in Firefox, several others likely abandoned Firefox.
The feedback kept coming from Japan that the problem still occurred from time to time. Telemetry indicated that it occurred in locales where the primary writing system is Traditional Chinese at even a higher rate than in Japan. In mainland China, which uses Simplified Chinese, the rate was a bit lower than in Japan but on a similar level. My hypothesis is that despite Traditional Chinese having a single legacy encoding in the Web Platform and Simplified Chinese also having a single legacy encoding in the Web Platform (for decoding purposes), which would predict these locales as having success from a locale-based guess, users read content across the Traditional/Simplified split on generic domains (where a guess from the top-level domain didn’t apply) but don’t treat out-of-locale failures as reportable. In the case of Japan and the Japanese language, the failures are in-locale and apparently treated as more reportable.
In general, as one would expect, the menu usage is higher in locales where the primary writing system is not the Latin script than in locales where the primary writing system is the Latin script. Notably, Bulgaria was pretty high on the list. We enabled our previous Cyrillic detector (which probably wasn’t trained with Bulgarian) by default for the Russian and Ukrainian localizations but not for the Bulgarian localization. Also, we lacked a TLD mapping for .cy and, sure enough, the rate of menu usage was higher in Cyprus than in Greece.
Again, as one would expect, the menu usage was higher in non-windows-1252 Latin-script locales than in windows-1252 Latin-script locales. Notably, the usage in Azerbaijan was unusually high compared to other Latin-script locales. I had suspected we had the wrong TLD mapping for .az, because at the time the TLD mappings were introduced to Firefox, we didn’t have an Azerbaijani localization to learn from.
In fact, I suspected that our TLD mapping for .az was wrong at the very moment of making the TLD mappings, but I didn’t have data to prove it, so I erred on the side of inaction. Now, years later, we have the data. It seems to me that it was better to trust the feedback from Japan and fix this problem than to gather more data to prove that the Web compat problem deserved fixing.
So why not fix this earlier or at the latest when Chrome introduced their present detector? Generally, it seems a better idea to avoid unobvious behaviors in the Web Platform if they are avoidable, and at the time Microsoft was going in the opposite direction compared to Chrome with neither a menu nor a detector in EdgeHTML-based Edge. This made it seem that there might be a chance that the Web Platform could end up being simpler, not more complex, on this point.
While it would be silly to suggest that this particular issue factored into Microsoft’s decision to base the new Edge on Chromium, the switch removed EdgeHTML as a data point suggesting that things could be less complex than they already were in Gecko. At this point, it didn’t make sense to pretend that if we didn’t start detecting more, Chrome and the new Edge would start detecting less.
Safari has less detection than Firefox had previously. As far as I can tell, Safari doesn’t have TLD-based guessing, and the fallback comes directly from the UI language with the detail that if the UI language is Japanese, there’s content-based guessing between Japanese legacy encodings. However, to the extent Safari/WebKit is engineered to the iOS market conditions, we can’t take this as an indication that this level of guessing would be enough for Gecko relative to Chromium. If users encounter this problem on iOS, there’s no other engine they can go to, and the problem is rare enough that users aren’t going to give up on iOS as a whole due to this problem. However, on every platform that Gecko runs on, a Chromium-based browser a couple of clicks or taps away.
After deciding to fix this, there’s the question of how to fix it. Why write new code instead of reusing existing code?
ICU4C has a detector, but Chrome had already rejected it as not accurate enough. Indeed, in my own testing with title-length input, ICU4C was considerably less accurate than the alternatives discussed below.
At this point you may be thinking “Wait, didn’t Firefox have a ‘Universal’ detector and weren’t you the one who removed it?” Yes and yes.
Firefox used to have two detectors: the “Universal” detector, also known as chardet, and a separate Cyrillic detector. chardet had the following possible configurations: Japanese, Traditional Chinese, Simplified Chinese, Chinese, Korean, and Universal, where the last mode enabled all the detection capabilities and the other modes enabled subsets only.
This list alone suggests a problem: If you look at the Web Platform as it exists today, Traditional Chinese has a single legacy encoding, Simplified Chinese has a single legacy decode mode (both GBK and GB18030 decode the same way and differ on the encoding side), and Korean has a single legacy encoding. The detector was written at a time when Gecko was treating character encodings as Pokémon and tried to catch them all. Since then we’ve learned that EUC-TW (an encoding for Traditional Chinese), ISO-2022-CN (an encoding for both Traditional Chinese and Simplified Chinese), HZ (an encoding for Simplified Chinese), ISO-2022-KR (an encoding for Korean), and Johab (an encoding for Korean) never took off on the Web, and we were able to remove them from the platform. With a single real legacy encoding per each of the Traditional Chinese, Simplified Chinese, and Korean, the only other detectable was UTF-8, but detecting it is problematic (more on this later).
When I filed the removal bug for the non-Japanese parts of chardet in early 2013, chardet was enabled in its Japanese-only mode by default for the Japanese localization, the separate Cyrillic detector was enabled for the Russian and Ukrainian localizations, and chardet was enabled in the Universal mode for the Traditional Chinese localization. At the time, the character encoding defaults for the Traditional Chinese localization differed from how they were supposed to be set in another way, too: It had UTF-8 instead of Big5 as the fallback encoding, which is why I didn’t take the detector setting very seriously, either. In the light of later telemetry, the combined Chinese mode, which detected between both Traditional and Simplified Chinese, would have been useful to turn on by default both for the Traditional Chinese and Simplified Chinese localizations.
From time to time, there were requests to turn the “Universal” detector on by default for everyone. That would not have worked, because the “Universal” detector wasn’t actually universal! It was incomplete and had known bugs, but people who hadn’t examined it more closely took the name at face value. That is, over a decade after the detector paper was presented (on September 12th 2001), the detector was still incomplete, off by default except in the Japanese subset mode and in a case that looked like an obvious misconfiguration, and the name was confusing.
The situation was bad as it was. There were really two options: removal or actually investing in fixing it. At the time, Chrome hadn’t gone all-in with detection. Given what other browsers were doing and not wishing to make the Web Platform more complex than it appeared to need to be, I pushed for removal. I still think that the call made sense considering the circumstances and information available.
As for whether it would have made sense to resurrect the old detector in 2019, digging up the code from version control history would still have resulted in an incomplete detector that would not have been suitable for turning on by default generally. Meanwhile, a non-Mozilla fork had added some things and another effort had resulted in a Rust port. Apart from having to sort out licensing (the code forked before the MPL 2.0 upgrade, the Rust port retained only the LGPL3 part of the license options, and Mozilla takes compliance with the LGPL relinking clause seriously making the use of LGPLed code in Gecko less practical than the use of other Open Source code) even the forks would have required more work to complete despite at least the C++ fork being more complete than the original Mozilla code.
Moreover, despite validating legacy CJK encodings being one of the foundational ideas of the old detector (called “Coding Scheme Method” in the paper), the old detector didn’t make use of the browser’s decoders for these encodings but implemented the validation on its own. This is great if you want a C++ library that has no dependencies, but it’s not great if you are considering the binary size cost in a browser that already has validating decoders for these encodings and ships on Android where binary size still matters.
A detector has two big areas: How to handle single-byte encodings and how to handle legacy CJK. For the former, if you need to develop tooling to add support for some more single-byte encodings or language coverage for single-byte encoding already detected for some languages, you will have built tooling that allows you to redo all of them. Therefore, if the prospect is adding support for more single-byte cases and to reuse for CJK detection the decoders that the browser has anyway, doing the work within the frame of the old decoder becomes a hindrance and it makes sense to frame it as newly-written code only instead of trying to formulate the changes as patches to what already existed.
Using Chrome’s detector (ced) looks attractive on the surface. If Firefox used the exact same detector as Chrome, Firefox could never do worse than Chrome, since both would do the same thing. There are problems with this, though.
As a matter of a health-of-the-Web principle, it would be bad if an area of the Web platform became defined as having to run a particular implementation. Also, the way ced integrates with Chrome violates the design principles of Firefox’s HTML parser. What gets fed to ced in Chrome depends on buffer boundaries as delivered by the networking subsystem to the HTML parser. Prior to Firefox 4, HTML parsing in Firefox depended on buffer boundaries in an even worse way. Since Firefox 4, I’ve considered it a goal not to make Firefox’s mapping of a byte stream into a DOM dependent on the network buffer boundaries (or the wall clock). Therefore, if I had integrated ced into Firefox, the manner of integration would have been an opportunity for different results, but that would have been negligible. That is, these principled issues don’t justify putting effort into a new implementation.
The decisive reasons are these: ced is over-the-wall Open Source to the point that even Chrome developers don’t try to change it beyond post-processing its output and making it compile with newer compilers. License-wise the code is Open Source / Free Software, but it’s impractical to exercise the freedom to make modifications to its substance, which makes it close to an invariant section in practice (again, not as a matter of license). The code doesn’t come with the tools needed to regenerate its generated parts. And even if it did, the input to those tools would probably involve Google-specific data sources. There isn’t any design documentation (that I could find) beyond code comments. Adopting ced would have meant adopting a bunch of C++ code that we wouldn’t be able to meaningfully change.
But why would one even want to make changes if the goal isn’t to ever improve detection and the goal is just to ensure our detection is never worse than Chrome’s? First of all, as in the case of chardet, ced is self-contained and doesn’t make use of the algorithms and data that a browser already has to have in order to decode legacy CJK encodings once detected. But it’s worse than that. “Post-processing” in the previous paragraph means that ced has more potential outcomes than a browser engine has use for, since ced evidently was not developed for the browser use case. (As noted earlier, ced has the appearance of having been developed for Google Search and Gmail.)
For example, ced appears to make distinctions between various flavors of Shift_JIS in terms of their carrier-legacy emoji mappings (which may have made sense for Gmail at some point in the past). It’s unclear how much table duplication results form this, but it doesn’t appear to be full duplication for Shift_JIS. Still, there are other cases that clearly do lead to table duplication even though the encodings have been unified in the Web Platform. For example, what the Web Platform unifies as a single legacy Traditional Chinese encoding occurs as three distinct ones in ced and what the Web Platform unifies as a single legacy Simplified Chinese encoding for decoding purposes appears as three distinct generations in ced. Also, ced keeps around data and code for several encodings that have been removed from the Web Platform (to a large extent because Chrome demonstrated the feasibility of not supporting them!), for KOI8-CS (why?), and for a number of IE4/Netscape 4-era / pre-iOS/Android-era deliberately misencoded fonts deployed in India. (We’ll come back to those later.)
My thinking around internationalization in relation to binary size is still influenced by the time period when, after shipping on desktop, Mozilla didn’t ship the ECMAScript i18n API on Android for a long time due to binary size concerns. If we had adopted ced and later decided that we wanted to run an effort (BinShrink?) to make the binary size smaller for Android, it would probably have taken more time and effort to save as many bytes from somewhere else (as noted, changing ced itself is something even the Chrome developers don’t do) as chardetng saves relative to just adopting ced than it took me to write chardetng. Or maybe, if done carefully, it would have been possible to remove the unnecessary parts of ced without access to the original tools and to ensure that the result still kept working, but writing the auxiliary code to validate the results of such an effort would have been on the same order of magnitude of effort as writing the tooling for training and testing chardetng.
That most Open Source encoding detectors are ports of the old Mozilla code, that ICU’s effort to write their own resulted in something less accurate, and the assumption that Google’s detector draws from some massive Web-scale analysis make it seem like writing a detector from scratch is a huge deal. It isn’t that big a deal, really.
After one has worked on implementing character encodings for a while, some patterns emerge. These aren’t anything new. The sort of things that one notices are the kinds of things the chardet paper attributes to Frank Tang noticing while at Netscape. Also, I already knew about language-specific issues related to e.g. Albanian, Estonian, Romanian, Mongolian, Azerbaijani, and Vietnamese that are discussed below. That is, many of the issues that may look like discoveries in the development process are things that I already knew before writing any code for chardetng.
Furthermore, these days, you don’t need to be Google to have access to a corpus of language-labeled human-authored text: Wikipedia publishes database dumps. Synthesizing legacy-encoded data from these has the benefit that there’s no need to go locate actual legacy-encoded data on the Web and check that it’s correctly labeled.
An encoding detector is a component with identifiable boundaries and a very narrow API. As such, it’s perfectly suited to be exposed via Foreign Function Interface. Furthermore, the plan was to leverage the existing CJK decoders from encoding_rs—the encoding conversion library used in Firefox—and that’s already Rust code. In this situation, it would have been wrong not to take the productivity and safety benefits of Rust. As a bonus, Rust offers the possibility of parallelizing the code using Rayon, which may or may not help, which isn’t known before measuring, but the cost of trying is very low whereas the cost of planning for parallelism in C++ is very high. (We’ll see later that the measurement was a disappointment.)
With the “why” out of the way, let’s look at how chardetng actually works.
One principled concern related to just using ced was that it’s bad for interoperability of the Web Platform depending on a single implementation that everyone would have to ship. Since chardetng is something that I just made up without a spec, is it any better in this regard?
Before writing a single line of code I arranged the data tables used by chardetng to be under CC0 so as enable their inclusion in a WHATWG spec for encoding detection, should there be interest for one. Also, I have tried to keep everything that chardetng does explainable. The non-CC0 (Apache-2.0 OR MIT) part of chardetng is under 3000 lines of Rust that could realistically be reversed into spec English in case someone wanted to write an interoperable second implementation.
The training tool for creating the data tables given Wikipedia dumps would be possible to explain as well. It runs in two phases. The first one computes statistics from Wikipedia dumps, and the second one generates Rust files from the statistics. The intermediate statistics are available, so as long as you don’t change the character classes or the set of languages, which would invalidate the format of the statistics, you can make changes to the later parts of the code generation and re-run it from the same statistics that I used.
The most foundational idea of chardetng is the observation that legacy CJK encodings have enough structure to them that if you take bytes and a decoder for a legacy CJK encoding and try to decode the bytes with the decoder, if the bytes aren’t intended to be in that encoding, the decoder will report an error sooner or later, with the exception that EUC-family encodings may decode without error as other EUC-family encodings. And that a browser (or any app for that matter) that wants to make useful use of the detection result has to have those decoders anyway.
The EUC family (EUC-JP, EUC-KR, and GBK—the GB naming is more commonly used than the EUC-CN name) can be distinguished by observing that Japanese text has kana, which distinguishes EUC-JP, and Hanja is very rare in Korean, so if enough EUC byte pairs fall outside the KS X 1001 Hangul range, chances are the text is Chinese in GBK.
Some single-byte encodings have unassigned bytes. If an unassigned byte occurs, the encoding can be removed from consideration. C1 controls can be treated as if unassigned.
Bicameral scripts have a certain regularity to capital letter use. Even though e.g. brand names can violate these regularities, the bulk of text should consist of words that are lower-case, start with an upper-case letter, or are in all-caps.
Non-Latin letters generally don’t occur right before or right after Latin letters. Hence, pairing an ASCII letter with a non-ASCII letter is indicative of the Latin script and should be penalized in non-Latin-script encodings.
Single-byte encodings can be distinguished well enough from the relative probabilities of byte pairs excluding ASCII pairs. These pairings don’t need to and should not be analyzed on exact byte values but the bytes should be classified more coarsely such that ASCII punctuation and parantheses, etc., are classified as space-equivalent and at least upper and lower case of a given letter are unified.
Pairs of ASCII bytes should be neutral in terms of the detection outcome apart from ISO-2022-JP detection and cases where the first ASCII byte of a pair is interpreted as a trail byte of a legacy CJK sequence. This way, the detector ignores various computer-readable syntaxes such as HTML (for deployment) and MediaWiki syntax (for training) without any syntax-specific state machine(s).
Visual Hebrew can be distinguished by observing the placement of ASCII punctuation relative to non-ASCII words.
Avoid detecting encodings that have never worked without declaration in any localization of a major browser.
There are two major observations to make of the above ideas:
The last point on the foundational idea list suggests that chardetng does not detect all encodings in the Encoding Standard. This indeed is the case. After all, the purpose of the detector is not to catch them all but to deal with the legacy that has arisen from locale-specific defaults and, to some extent, previously-deployed detectors. Most encodings in the Encoding Standard correspond to the “ANSI” code page of some Windows localization and, therefore, the default fallback in some localization of Internet Explorer. Additionally, ISO-8859-2 and ISO-8859-7 have appeared as the default fallback in non-Microsoft browsers. Some encodings require a bit more justification for why they are included or excluded.
ISO-2022-JP and EUC-JP are included, because multiple browsers have shipped on-by-default detectors over most of the existence of the Web with these as possible outcomes. Likewise, ISO-8859-5, IBM866, and KOI8-U (the Encoding Standard name for the encoding officially known as KOI8-RU) are detected, because they were detected by Gecko and are detected by IE and Chrome. The data published as part of ced also indicates that these encodings have actually been used on the Web.
ISO-8859-4 and ISO-8859-6 are included, because both IE and Chrome detect them, they have been in the menu in IE and Firefox practically forever, and the data published as part of ced indicates that they have actually been used on the Web.
ISO-8859-13 is included, because it has the same letter assignments as windows-1257, so browser support for windows-1257 has allowed ISO-8859-13 to work readably. However, disqualifying an encoding based on one encoding error could break such compatibility. To avoid breakage due to eager disqualification of windows-1257, chardetng supports ISO-8859-13 explicitly despite it not qualifying for inclusion on the usual browser legacy behavior grounds. (The non-UTF-8 glibc locales for Lithuanian and Latvian used ISO-8859-13. Solaris 2.6 used ISO-8859-4 instead.)
ISO-8859-8 is supported, because it has been in the menu in IE and Firefox practically forever, and if it wasn’t explicitly supported, the failure mode would be detection as windows-1255 with the direction of the text swapped.
KOI8-R, ISO-8859-8-I, and GB18030 are detected as KOI8-U, windows-1255, and GBK instead. KOI8-U differs from KOI8-R by assigning a couple of box drawing bytes to letters instead, so at worst the failure mode of this unification is some box drawing segments (which aren’t really used on the Web anyway) showing up as letters. windows-1255 is a superset of ISO-8859-8-I except for swapping the currency symbol. GBK and GB18030 have the same decoder in the Encoding Standard. However, chardetng makes no attempt to detect the use of GB18030/GBK for content other than Simplified Chinese despite the ability to also represent other content being a key design goal of GB18030. As far as I am aware, there is no legacy mechanism that would have allowed Web authors to rely on non-Chinese usage of GB18030 to work without declaring the encoding. (I haven’t evaluated how well Traditional Chinese encoded as GBK gets detected.)
x-user-defined as defined in the Encoding Standard (as opposed to how an encoding of the same name is defined in IE’s mlang.dll) is relevant to XMLHttpRequest but not really to HTML, so it is not a possible detection outcome.
The legacy encodings that the Encoding Standard maps to the replacement encoding are not detected. Hence, the replacement encoding is not a possible detection outcome.
UTF-16BE and UTF-16LE are not detected by chardetng. They are detected from the BOM outside chardetng. (Additionally for compatibility with IE’s U+0000 ignoring behavior as of 2009, Firefox has a hack to detect Latin1-only BOMless UTF-16BE and UTF-16LE.)
The macintosh encoding (better known as MacRoman) is not detected, because it has not been the fallback for any major browser. The usage data published as part of ced suggests that the macintosh encoding exists on the Web, but the data looks a lot like the data is descriptive of ced’s own detection results and is recording misdetection.
x-mac-cyrillic is not detected, because it isn’t detected by IE and Chrome. It was previously detected by Firefox, though.
ISO-8859-3 is not detected, because it hasn’t been the fallback for any major browser or a menu item in IE. The IE4 character encoding documentation published on the W3C’s site remarks of ISO-8859-3: “not used in the real world”. The languages for which this encoding is supposed to be relevant are Maltese and Esperanto. Despite glibc having an ISO-8859-3 locale for Maltese, the data published as part of ced doesn’t show ISO-8859-3 usage under the TLD of Malta. The ced data shows usage for Catalan, but this is likely a matter of recording misdetection, since Catalan is clearly an ISO-8859-1/windows-1252 language.
ISO-8859-10 is not detected, because it hasn’t been the fallback for any major browser or a menu item in IE. Likewise, the anachronistic (post-dating UTF-8) late additions to the series, ISO-8859-14, ISO-8859-15, and ISO-8859-16, are not detected for the same reason. Of these, ISO-8859-15 differs from windows-1252 so little and for such rare characters that detection isn’t even particularly practical, and having ISO-8859-15 detected as windows-1252 is quite tolerable.
I assigned the single-byte encodings to groups such that multiple encodings that address roughly the same character repertoire are together. Some groups only have one member. E.g. windows-1254 is alone in a group. However, the Cyrillic group has windows-1251, KOI8-U, ISO-8859-5, and IBM866.
I assigned character classes according to the character repertoire of each group. Unifying upper and lower case can be done algorithmically. Then the groupings of characters can be listed somewhere and a code generator can generate lookup tables that are indexed by byte value yielding another 8-bit number. I reserved the most significant bit of the number read from the lookup tables to indicate case for bicameral scripts. I also split the classification lookup table in the middle in order to reuse the ASCII half across the cases that share the ASCII half (non-Latin vs. non-windows-1254 Latin vs. windows-1254). windows-1254 requires a different ASCII classification in order not to treat ‘i’ and ‘I’ as a case pair. Other than that, for Latin-script encodings ASCII letters as individual classes matter and for non-Latin-script, they don’t.
I looked at the list of Wikipedias. I excluded languages with fewer than 10000-article Wikipedias as well as languages that don’t have a legacy of having been written in single-byte encodings in the above-mentioned groupings and languages whose orthography is all-ASCII or nearly all-ASCII. Then I assigned the remaining languages to the encoding groups.
I wrote a tool that ingested the Wikipedia database dumps for those languages and for each language normalized the input to Unicode Normalization Form C (just in case; I didn’t bother examining how well Wikipedia was already in NFC) and then classified each Unicode scalar value according to the classification built above such that characters not part of the mapping were considered equivalent to spaces, because ampersand and semicolon were treated as being in the same equivalence class as space, and unmappable characters would be represented as numeric character references in single-byte encodings, so the adjacencies would be with ampersand and semicolon.
The program counted the pairs, ignoring ASCII pairs, divided the count for each pair by the total (non-ASCII) pair count and divided also by class size (with a specially-picked divisor for the space-equivalent class), where class size didn’t consider case (i.e. upper and lower case didn’t count as two). Within an encoding group, the languages were merged together by taking the maximum value across languages for each character class pair. If the original count was actually zero, i.e. no occurrence of the pair in relevant Wikipedias, the output lookup table got an implausibility marker (number 255). Otherwise, the floating point results were scaled to the 0 to 254 range to turn them into relative scores that fit into a byte. The scaling was done such that the highest spikes got clipped in a way that retained reasonable value range otherwise instead of mapping the highest spike to 254. I also added manually-picked multipliers for some encoding groups. This made it possible to e.g. boost Greek a bit relative to Cyrillic, which made accuracy better for short inputs, and for long inputs ends up correcting for misdetecting Cyrillic as Greek anyway because Greek has unmapped bytes such that sufficiently long windows-1251 input ends up disqualifying the Greek encodings according to the rule that a single unmapped byte disqualifies an encoding from consideration.
Thanks to Rust compiling to very efficient code and Rayon making it easy to process as many Wikipedias in parallel as there are hardware threads, this is a quicker processing task than it may seem.
The pairs logically form a two-dimension grid of scores (and case bits for bicameral scripts) where the rows and column are the character classes participating in the pair. Since ASCII pairs don’t contribute to the score, the part of the grid that would correspond to ASCII-ASCII pairs is not actually stored.
Wikipedia uses present-day Unicode-enabled orthographies. For some languages, this is not the same as the legacy 8-bit orthography. Fortunately, with the exception of Vietnamese, going from the current orthography to the legacy orthography is a simple matter of replacing specific Unicode scalar values with other Unicode scalar values one-to-one. I made the following substitutions for training (also in upper case):
These were the languages that pretty obviously required such replacements. I did not investigate the orthography of languages whose orthography I didn’t already expect to require measures like this, so there is a small chance that some other language in the training set would have required similar substitution. I’m fairly confident that that isn’t the case, though.
For Vietnamese the legacy synthesis is a bit more complicated. windows-1258 cannot represent Vietnamese in Unicode Normalization Form C. There is a need to decompose the characters. I wrote a tiny crate that performs the decomposition and ran the training with both plausible decompositions:
I assume that the languages themselves haven’t shifted so much from the early days of the Web that the pairwise frequencies observed from Wikipedia would not work. Also, I’m assuming that encyclopedic writing doesn’t disturb the pairwise frequencies too much relative to other writing styles and topics. Intuitively this should be true for alphabetic writing, but I have no proof. (Notably, this isn’t true for Japanese. If one takes a look at the most frequent kanji in Wikipedia titles and the most frequent kanji in Wikipedia text generally, the titles are biased towards science-related kanji.)
The Latin script does not fit in an 8-bit code space (and certainly not in the ISO-style code space that wastes 32 code points for the C1 controls). Even the Latin script as used in Europe does not fit in an 8-bit code space when assuming precomposed diacritic combinations. For this reason, there are multiple Latin-script legacy encodings.
Especially after the 1990s evolution of the ISO encodings into the Windows encodings, two languages, Albanian and Estonian, are left in a weird place in terms of encoding detection. Both Albanian and Estonian can be written using windows-1252 but their default “ANSI” encoding in Windows is something different: windows-1250 (Central European) and windows-1257 (Baltic), respectively.
With Albanian, the case is pretty clear. Orthographically, Albanian is a windows-1252-compatible language, and the data (from 2007?) that Google published as part of ced shows the Albanian language and the TLD for Albania strongly associated with windows-1252 and not at all associated with either windows-1250 or ISO-8859-2. It made sense to put Albanian in the windows-1252 training set for chardetng.
Estonian is a trickier case. Estonian words that aren’t recent (from the last 100 years or so?) loans can be written using the ISO-8859-1 repertoire (the non-ASCII letters being õ, ä, ö, and ü; et_EE without further suffix in glibc is an ISO-8859-1 locale, and et was an ISO-8859-1 locale in Solaris 2.6, too). Clearly, Estonian has better detection synergy with Finnish, German, and Portuguese than with Lithuanian and Latvian. For this reason, chardetng treats Estonian as a windows-1252 language.
Although the official Estonian character reportoire is fully part of windows-1252 in the final form of windows-1252 reached in Windows 98, it wasn’t the case when Windows 95 introduced an Estonian localization of Windows and the windows-1257 Baltic encoding. At that time, windows-1252 didn’t yet have ž, which was added in Windows 98—presumably in order to match all the letter additions that ISO-8859-15 got relative to ISO-8859-1. (ISO-8859-15 got ž and š as a result of lobbying by the Finnish language regulator which insists that these letters be used for certain loans in Finnish. windows-1252 already had š in Windows 95.) While the general design principle of the window-125x series appears to be that if a character occurs in windows-1252, it is in the same position in the other windows-125x encodings that it occurs in, this principle does not apply to the placement of š and ž in windows-1257. ISO-8859-13 has the same placement of š and ž as windows-1257. ISO-8859-4 has yet different placement. (As does ISO-8859-15, which isn’t a possible detection outcome of chardetng.) The Estonian-native vowels are in the same positions in all these encodings.
The Estonian language regulator designates š and ž as part of the Estonian orthography, but these characters are rare in Estonian, since they are only used in recent loans and in transliteration. It’s completely normal to have Web pages where the entire page has neither or the page has only one occurrence of either of them. Still, they are common enough that you can find them on a major newspaper site with a few clicks. This, obviously, is problematic for encoding detection. Estonian gets detected as windows-1252. If the encoding actually was windows-1257, ISO-8859-13, ISO-8859-4, or ISO-8859-15, the (likely) lone instance of š or ž gets garbled.
It would be possible to add Estonian-specific post-processing logic to map the windows-1252 result to windows-1257, ISO-8859-13, or ISO-8859-4 (or even ISO-8859-15) if the content looks Estonian based on the Estonian non-ASCII vowels being the four most frequent non-ASCII letters and then checking which encoding is the best fit for š and ž. However, I haven’t taken the time to do this, and it would cause reloads of Web pages just to fix maybe one character per page.
The foundational ideas described above weren’t quite enough. Some refinements were needed. I wrote a test harness that synthetized input from Wikipedia titles and checked how well titles that encoded to non-ASCII were detected in such a way that they roundtripped. (In some cases, there are multiple encodings that roundtrip a string. Detecting any one of them was considered a success.) I looked at the cases that chardetng failed but ced succeeded at.
When looking at the results, it was pretty easy to figure out why a particular case failed, and it was usually pretty quick and easy to come up with and implement a new rule to address the failure mode. This iteration could be continued endlessly, but I stopped early when the result seemed competitive enough compared to ced. However, the last two items on the list were made in response to a bug report. Here are the adjustments that I made.
Some non-ASCII letters in Latin-script encodings got high pair-wise scores leading to misdetecting non-Latin as Latin. I remedied this by penalizing sequences of three non-ASCII letters in Latin encodings a little and penalizing sequences of four or more a lot. (Polish and Turkish do have some legitimate sequences of four non-ASCII letters, though.)
Treating non-ASCII punctuation and symbols as space-equivalent didn’t work, because pairing a letter with a space-like byte tends to score high but some encodings assign letters where others have symbols or punctuation. Therefore, when bytes were intended to be two letters but could be interpreted as a symbol and letter in another encoding, the latter interpretation scored higher due to the letter and space-like combination scoring higher. Such symbols and punctuation needed new non-space-equivalent character classes. I adjusted things such that no byte above 0xA0 in any single-byte encoding can get grouped together in the same character class as the ASCII space. No-break space when assigned to 0xA0 as well as Windows curly quotes and dashes when assigned below 0xA0 remain in the same character class as the ASCII space, though.
Symbols in the 0xA1 to 0xBF byte range were split further into classes that have different implausibility characteristics before or after letters (or both) or that are implausible next to characters from their own class. (Things like ® typically appear after letters but © appears before letters if next to a letter at all, and ®® and ©© are very unlikely.)
Some other characters had existence proof of occurring in pairs in Wikipedia that are in practice extremely unlikely and benefited from manually-forced implausibility. Examples include forcing implausibility of a letter after Greek final sigma, forcing implausibility of left-to-right mark and right-to-left mark next to a letter (they are supposed to be used between punctuation and space), and marking Vietnamese tones as implausible after letters other than the ones that are legitimate base characters in Vietnamese orthography. (Note that an implausibility penalty isn’t absolutely disqualifying, so long input can tolerate isolated implausibilities.)
Giving non-ASCII bicameral-script words that start with a capital letter a slight boost helped distinguish between Greek and the various non-Windows Cyrillic encodings.
Splitting the windows-1252 model into two: Icelandic and Faroese on one hand and the rest on the other. Since characters that (within the windows-1252 language set) are specific to Icelandic and Faroese are the first ones that that get replaced with something else in other Latin-script Windows and ISO encodings, not merging their use with scores for other windows-1252 languages helps keep the models for windows-1257 and windows-1254 relatively more distinctive.
I made use of the fact that present-day Korean uses ASCII spaces between words while Chinese and Japanese don’t.
Giving the same score to any Han character turned out to be a bad idea in terms of being able to distinguish legacy CJK encodings from non-Latin single-byte encodings. Fortunately, the legacy CJK encodings have a coarse frequency classification built in, and the most frequent class has nice properties relative to single-byte encodings.
JIS X 0208, GB2312, and the original Big5 have their kanji/hanzi organized into two levels roughly according to respective locales’ education systems’ classification at the time of initial standard creation. That is, Level 1 corresponds to the most frequent kanji/hanzi that the education systems prioritize. KS X 1001 instead splits into common Hangul and to hanja (very rare). This means that just by looking at the bytes, it’s possible to classify legacy CJK characters into three frequency classes: Level 1 (or Hangul), Level 2 (or hanja in the case of EUC-KR), and other (rare Hangul in the case of EUC-KR). These three can, and now are, given different scores.
Moreover, these have the fortuitous byte mapping that the Level 1 or common Hangul section in each encoding uses lower byte values for the lead byte while non-Thai, non-Arabic Windows and ISO encodings use high byte values for the common non-ASCII characters: for lower case in bicameral scripts or, in the case of Hebrew, for the consonants. This yields naturally distinctive scoring except for Thai and, to lesser extent, Arabic.
Some lead bytes in CJK encodings that overlap with windows-125x non-ASCII punctuation are problematic, because they pair with ASCII trail bytes in ways that can occur in Latin-script text without other adjacent letters that would trigger an Latin adjacency penalty. For example, without intervention, “Rock ’n Roll” could get interpreted as “Rock 地 Roll”. I made it so that the score for CJK characters with a problematic lead byte is committed only if the next character is a CJK character, too.
The differences between EUC-JP (presence of kana), EUC-KR (mainly just Hangul and with spaces between words), and GBK don’t necessarily show up in short titles. This is to be expected given the fundamental bet made in the design. Still, this made chardetng look bad relative to ced. I deviated from the plan of not having CJK frequency tables by including tables of the most frequent JIS X 0208 Level 1 Kanji, the most frequent GB2312 Level 1 Hanzi, and the most frequent KS X 1001 Hangul. I set the cutoff for “most frequent” to 128, so the resulting tables ended up being very small but still effective. (Big5 is structurally distinctive even with short inputs, so after trying including a table of the most frequent Big5 Level 1 Hanzi, I removed the table as unnecessary.)
Thai needed byte range-specific multipliers to tune it relative to GBK.
It was impractical to give score to windows-1252 ordinal indicators in the pairwise model without breaking Romanian detection. For this reason, there’s a state machine for giving score to ordinal indicator usage based on a little more context than just byte pairs taken into account. This boosted detecting accuracy especially for Italian but also for Portuguese, Castilian, Catalan, and Galician.
The byte 0xA0 is no-break space in most encodings. To avoid misdetecting an odd number of no-break spaces as IBM866 and to avoid misdetecting an even number as Chinese or Korean, 0xA0 is treated as a problematic lead for CJK purposes, and there’s a special case not to apply the score to IBM866 from certain combinations involving 0xA0.
To avoid detecting windows-1252 English as windows-1254, Latin candidates don’t count a score for a pair that involves an ASCII byte and a space-like non-ASCII byte. Otherwise, the score for Turkish dotless ı in word-final would be applied to English I’ (as in “I’ve” or “I’m”). While this would decode to the right characters and look right, it would cause an unnecessary reload in Firefox.
While the single-byte letter pair scores arose from data and the Level 1 Kanji/Hanzi was then calibrated relative to Arabic scoring (and the common Hangul score is one higher than that with a penalty for implausibly long words to distinguish from Chinese given enough input), in the interest of expediency, I assigned the rest of the scoring, including penalty values, as educated guesses rather than trying to build any kind of training framework to try to calibrate them optimally.
The general accuracy characterization relates to generic domains, such as .com. In the case of ccTLDs that actually appear to be in local use (such as .fi) as opposed to generic use (such as .tv), chardetng penalizes encodings that are less plausible for the ccTLD but are known to be confusable with the encoding(s) plausible for the ccTLD. Therefore, typical legacy content on ccTLDs is even more likely to get the right guess than legacy content on generic domains. On the flip side, the guess may be worse for atypical legacy content on ccTLDs.
Firefox asks chardetng to provide a guess twice during the loading of a page. If there is no HTTP-level character encoding declaration and no BOM, Firefox buffers up to 1024 bytes for the
<meta charset> pre-scan. Therefore, if the pre-scan fails and content-based detection is necessary, there always already is a buffer of the first 1024 bytes. chardetng makes it first guess from that buffer and the top-level domain. This doesn’t involve reloading anything, because the first 1024 byte haven’t been decoded yet.
Then, when the end of the stream is reached, chardetng guesses again. If the guess differs from the earlier guess, the page is reloaded using the new guess. By making the second guess at the end of the stream, chardetng has the maximal information available, and there is no need to estimate what portion of the stream would have been sufficient to make the guess. Also, unlike Chrome’s approach of examining the first chunk of data that the networking subsystem passes to the HTML parser, this approach does not depend on how the stream is split into buffers. If the title of the page fit into the first 1024 bytes, chances are very good that the first guess was already correct. (This is why evaluating accuracy for “title-length input” was interesting.) Encoding-related reloads are a pre-existing behavior of Gecko. A
<meta charset> that does not fit in the first 1024 bytes also triggers a reload.
In addition to the .in and .lk TLDs being exempt from chardetng (explanation why comes further down), the .jp TLD uses a Japanese-specific detector that makes its decision as soon as logically possible to decide between ISO-2022-JP, Shift_JIS and EUC-JP rather than waiting until the end of the stream.
So is it any good?
Is it more or less accurate that ced? This question does not have a simple answer. First of all, accuracy depends on the input length, and chardetng and ced scale down differently. They also scale up differently. After all, as noted, one of the fundamental bets of chardetng was that it was OK to let legacy CJK detection accuracy scale down less well in order to have binary size savings. In contrast, ced appears to have been designed to scale down well. However, ced in Chrome gets to guess once per page but chardetng in Firefox gets to revise its guess. Second, for some languages ced is more accurate and for others chardetng is more accurate. There is no right way to weigh the languages against each other to come up with a single number.
Moreover, since so many languages are windows-1252 languages, just looking at the number of languages is misleading. A detector that always guesses windows-1252 would be more accurate for a large number of languages, especially with short inputs that don’t exercise the whole character repertoire of a given language. In fact, guessing windows-1252 for pretty much anything Latin-script makes the old chardet (tested as the Rust port) look really good for windows-1252. (Other than that, both chardetng and ced are clearly so much better than ICU4C and old chardet that I will limit further discussion to chardetng vs. ced.)
There are three measures of accuracy that I think are relevant:
Given page titles that contain at least one non-ASCII character, what percentage is detected right (where “right” is defined as the bytes decoding correctly; for some bytes there may be multiple encodings that decode the bytes the same way, so any of them is “right”)?
Since Firefox first asks chardetng to guess from the first 1024 bytes, chances are that the only bit of non-ASCII content that participates in the guess is the page title.
Given documents that contain at least one non-ASCII character, what percentage is detected right (where “right” is defined as the bytes decoding correctly; for some bytes there may be multiple encodings that decode the bytes the same way, so any of them is “right”)?
Since Firefox asks chardetng to guess again at the end of the stream, it is relevant how well the detector does with full-document input.
Given the guess that is made from a full document, what prefix length, counted as number of non-ASCII bytes in the prefix, is sufficient to look at for getting the same result?
The set of languages I tested was the set of languages that have Web Platform-relevant legacy encodings, have at least 10000 articles in the language’s Wikipedia, and are not too close to being ASCII-only: an, ar, az, be, bg, br, bs, ca, ce, cs, da, de, el, es, et, eu, fa, fi, fo, fr, ga, gd, gl, he, hr, ht, hu, is, it, ja, ko, ku, lb, li, lt, lv, mk, mn, nn, no, oc, pl, pt, ro, ru, sh, sk, sl, sq, sr, sv, th, tr, uk, ur, vi (orthographically decomposed), vi (minimally decomposed), wa, yi, zh-hans, zh-hant. (The last two were algorithmically derived from the zh Wikipedia using MediaWiki’s own capabilities.)
It’s worth noting that it’s somewhat questionable to use the same data set for training and for assessing accuracy. The accuracy results would be stronger if the detector was shown accurate on a data set independent from the training data. In this sense, the results for ced are stonger than the results for chardetng. Unfortunately, it’s not simple to obtain a data set alternative to Wikipedia, which is why the same data set is used for both purposes.
If we look at Wikipedia article titles (after rejecting titles that encode to all ASCII, kana-containing titles in Chinese Wikipedia, and some really short mnemonic titles for Wikipedia meta pages), we can pick some accuracy threshold and see how many languages ced and chardetng leave below the threshold.
No matter what accuracy threshold is chosen, ced leaves more combinations of language and encoding below the threshold, but among the least accurate are Vietnamese, which is simply unsupported by ced, and ISO-8859-4 and ISO-8859-6, which are not as relevant as Windows encodings. Still, I think it’s fair to say that chardetng is overall more accurate on this title-length threshold metric, although it’s still possible to argue about the relative importance. For example, one might argue that considering the number of users it should matter more which one does better on Simplified Chinese (ced) than on Breton or Walloon (chardetng). This metric is questionable, because there are so many windows-1252 languages (some of which make it past the 10000 Wikipedia article threshold by having lots of stub articles) that a detector that always guessed windows-1252 would get a large number of languages right (old chardet is close to this characterization).
If we put the threshold at 80%, the only languages that chardetng leaves below the threshold are Latvian (61%) and Lithuanian (48%). Fixing the title-length accuracy for Latvian and Lithuanian is unlikely to be possible without a binary size penalty. ced spends 8 KB on a trigram table that improves Latvian and Lithuanian accuracy somewhat but still leaves them less accurate than most other Latin-script languages. An alternative possibility would be to have distinct models for Lithuanian and Latvian to be able to boost them individually such that the boosted models wouldn’t compete with languages that match the combination of Lithuanian and Latvian but don’t match either individually. The non-ASCII letter sets of Lithuanian and Latvian are disjoint except for č, š and ž. Anyway, as noted earlier, the detector is primarily for the benefit of non-Latin scripts, since the failure mode for Latin scripts is relatively benign. For this reason, I have not made the effort to split Lithuanian and Latvian into separate models.
The bet on trading away the capability to scale down for legacy CJK shows up the most clearly for GBK: chardetng is only 88% accurate on GBK-encoded Simplified Chinese Wikipedia titles while ced is 95% accurate. This is due to the GBK accuracy of chardetng being bad with fewer than 6 hanzi. Five syllables is still a plausible Korean word length, so the penalty for implausibly long Korean words doesn’t take effect to decide that the input is Chinese. Overall, I think failing initially for cases where the input is shorter than 6 hanzi is a reasonable price to pay for the binary size savings. After all, once the detector has seen the whole page, it can for sure correct itself and figure out the distinction between GBK and EUC-KR. (Likewise, if the 80% threshold from the previous paragraph seems bad—one in five failure rate, it’s good to remember that it just means that one in five cases needs to correct the guess after having examined the whole page.)
In the table language/encoding combinations for which chardetng is worse than ced by more than one percentage point are highlighed with bold font and tomato background. The combinations for which chardetng is worse than ced by one percentage point are highlighted with italic font and thistle background.
|vi||windows-1258 (minimally decomposed)||91%||21%||22%||19%|
If we look at Wikipedia articles themselves and filter out ones whose wikitext UTF-8 byte length is 6000 or less (arbitrary threshold to try to filter out stub articles), chardetng looks even better compared to ced in terms of how many language are left below a given accuracy threshold.
If the accuracy is rounded to full percents, ced leaves 29 language/encoding combinations at worse than 98% (i.e. 97% or lower). chardetng leaves 8. Moreover, ced leaves 22 combinations below the 89% threshold. chardetng leaves 1: Lithuanian as ISO-8859-4. That’s a pretty good result!
|et||Better of windows-1252 and windows-1257||100%||98%||98%||98%|
|vi||windows-1258 (minimally decomposed)||99%||0%||0%||0%|
I examined how truncated input (starting from 10 non-ASCII bytes and continuing by 10-byte intervals (until 100 and then my coarser intervals) but truncating by one more if the truncation would otherwise render CJK input invalid) compared to document-length input.
For legacy CJK encodings, chardetng achieves document-length-equivalent accuracy with about 10 non-ASCII bytes. For most windows-1252 and windows-1251 languages, chardetng achieves document-length-equivalent accuracy with about 20 non-ASCII bytes. Obviously, this means shorter overall input for windows-1251 than for windows-1252. At 50 non-ASCII bytes, there are very few language/encoding combinations that haven’t completely converged. Some oscillate back a little afterwards, and almost everything has settled at 90 non-ASCII bytes.
Hungarian and ISO-8859-2 Romanian are special cases that haven’t completely converged even at 1000 non-ASCII bytes.
While the title-length case showed that ced scaled down better in some cases, the advantage is lost even at 10 non-ASCII bytes. While ced has document-lengh-equivalent accuracy for the legacy CJK encodings at 10 non-ASCII bytes, the rest take significantly longer to converge than they do with chardetng.
ced had a number of windows-1252 and windows-1251 cases that converged at 20 non-ASCII bytes as with chardetng. However, it had more cases, including windows-1252 and windows-1251 cases, whose convergenge went into hundrends of non-ASCII bytes. Notably, KOI8-U as an encoding was particularly bad at converging to document-length-equilavence and for most languages (for which it is relevant) had not converged even at 1000 non-ASCII bytes.
Overall, I think it is fair to say that ced may scale down better in some cases where there are fewer than 10 non-ASCII bytes, but chardetng generally scales up better from 10 non-ASCII bytes onwards. (The threshold may be a bit under 10, but because the computation of these tests is quite slow, I did not spend time searching for the exact threshold.)
ISO-8895-7 Greek exhibited strange enough behavior with both chardetng and ced in this test that it made me suspect the testing method has some ISO-8859-7-specific problem, but I did not have time to investigate the ISO-8895-7 Greek issue.
Is it more compact than ced? As of Firefox 78, chardetng and the special-purpose Japanese encoding detector shift_or_euc contribute 62 KB to x86_64 Android libxul size, when the crates that these depend on are treated as sunk cost (i.e. Firefox already has the dependencies anyway, so they don’t count towards the added bytes). When built as part of libxul, ced contributes 226 KB to x86_64 Android libxul size. The binary size contribution on chardetng and shift_or_euc together is 28% of what the binary size contribution of ced would be. The x86 situation is similar.
|chardetng + shift_or_euc||ced|
|.text||24.6 KB||34.3 KB|
|.rodata||30.8 KB||120 KB|
|.data.rel.ro||2.52 KB||59.9 KB|
On x86_64, the goal of creating something smaller than ced worked out very well. On ARMv7 and aarch64, chardetng and shift_or_euc together result in smaller code than ced but by a less impressive factor. PGO effects ended up changing other code in unfortunate ways so much that it doesn’t make sense to give exact numbers.
chardetng is slower than ced. In single-threaded mode, chardetng takes 42% longer than ced to process the same input on Haswell.
This is not surprising, since I intentionally resolved most tradeoffs between binary size and speed in favor of smaller binary size at the expense of speed. When I resolved tradeoffs in favor of speed instead of binary size, I didn’t do so primarily for speed but for code readability. Furthermore, Firefox feeds chardetng more data than Chrome feeds to ced, so it’s pretty clear that overall Firefox spends more time in encoding detection than Chrome does.
I think optimizing for binary size rather than speed is the right tradeoff for code that only runs on legacy pages and doesn’t run at all for modern pages. (Also, microbenchmarks don’t show the cache effects on the performance of other code that likely result from ced having a larger working set of data tables than chardetng does.)
The construction of encoding detectors is that there are a number of probes that process the same data logically independently of each other. On the surface, this structure looks perfect for parallelization using one of Rust’s superpowers: being able to easily convert an iteration to use multiple worker threads using Rayon.
Unfortunately, the result in the case of chardetng is rather underwhelming even with document-length passed in as 128 KB chunks (the best case with Firefox’s networking stack when the network is fast). While Rayon makes chardetng faster in terms of wall-clock time, the result is very far from scaling linearly with the number of hardware threads. With 8 hyperthreads available on a Haswell desktop i7, the wall-clock result using Rayon is still slower than ced running on a single thread. The synchronization overhead is significant, and the overall sum of compute time across threads is inefficient compared to the single-threaded scenario. If there is a reason to expect parallelism from higher-level task division, it doesn’t make sense to enable the Rayon mode in chardetng. As of Firefox 78, the Rayon mode isn’t used in Firefox, and I don’t expect to enable the Rayon mode for Firefox.
There are some imaginable risks that testing with data synthetized from Wikipedia cannot reveal.
The approach that a single encoding error disqualifies an encoding could disqualify Big5 or EUC-KR if there’s a single private-use character that is not acknowledged by the Encoding Standard.
The Encoding Standard definition of EUC-KR does not acknowledge the Private Use Area mappings in Windows code page 949. Windows maps byte pairs with lead byte 0xC9 or 0xFE and trail byte 0xA1 through 0xFE (inclusive) to the Private Use Area (and byte 0x80 to U+0080). The use cases, especially on the Web, for this area are mostly theoretical. However, if a page somehow managed to have a PUA character like this, the detector would reject EUC-KR as a possible detection outcome.
In the case of Big5, the issue is slightly less theoretical. Big5 as defined in the Encoding Standard fills the areas that were originally for private use but that were taken by Hong Kong Supplementary Character Set with actual non-PUA mappings for the HKSCS characters. However, this still leaves a range below HKSCS, byte pairs whose lead byte is 0x81 through 0x86 (inclusive), unmapped and, therefore, treated as errors by the Encoding Standard. Big5 has had more extension activity than EUC-KR, including mutually-incompatible extensions, and the Han script gives more of a reason (than Hangul) to use the End-User-Defined Characters feature of Windows. Therefore, it is more plausible that a private-use character could find its way into a Big5-encoded Web page than into an EUC-KR-encoded Web page.
For GBK, the Encoding Standard supports the PUA mappings that Windows has. For Shift_JIS, the Encoding Standard supports two-byte PUA mappings that Windows has. (Windows also maps a few single bytes to PUA code points.) Therefore, the concern raised in this section is moot for GBK and Shift_JIS.
I am slightly uneasy that Big5 and EUC-KR in the Encoding Standard are, on the topic of (non-HKSCS) private use characters, inconsistent with Windows and with the way the Encoding Standard handles GBK and Shift_JIS. However, others have been rather opposed to adding the PUA mappings and I haven’t seen actual instances of problems in the real world, so I haven’t made a real effort to get these mappings added.
Some scripts had (single-byte) legacy encodings that didn’t make it to IE4 and, therefore, the Web Platform. These were supported by having the page decode as windows-1252 (possible via an x-user-defined declaration that meant windows-1252 decoding in IE4; the Encoding Standard x-user-defined is Mozilla’s different thing relevant to legacy XHR) and having the user install an intentionally misencoded font that assigned non-Latin glyphs to windows-1252 code points.
In some cases, there was some kind of cross-font agreement on how these were arranged. For example, for Armenian there was ARMSCII-8. (Gecko at one point implemented it as a real character encoding, but doing so was useless, because Web sites that used it didn’t declare it as such.) In other cases, these arrangements were font-specific and the relevant sites were simply popular enough (e.g. sites of newspapers in India) to be able to demand that the user install a particular font.
ced knows about a couple of font-specific encodings for Devanagari and multiple encodings, both font-specific and Tamil Nadu state standard, for Tamil. The Tamil script’s visual features make it possible to treat it as more Thai-like than as Devenagari-like, which means that Tamil is more suited for font hacks than the other Brahmic scripts of India. Unicode adopted the approach standardized by the federal government of India in 1988 to treat Tamil as Devanagari-like (logical order) whereas in 1999 the state government of Tamil Nadu sought to promote treating Tamil the way Thai is treated in Unicode (visual order).
All indications are that ced being able to detect these font hacks has nothing to do with Chrome’s needs as of 2017. It is more likely that this capability was put there for the benefit of the Google search engine more than a decade earlier. However, Chrome post-processes the detection of these encodings to windows-1252, so if sites that use these font hacks still exist, they’d appear to work in Chrome. (ced doesn’t know about Armenian, Georgian, Tajik, or Kazakh 8-bit encodings despite these having had glibc locales.)
Do such sites still exist? I don’t know. Looking at the old bugs in Bugzilla, the reported sites appear to have migrated to Unicode. This makes sense. Despite deferring the migration for years and years after Unicode was usable, chances are that the rise of mobile devices has forced migration. It’s considerably less practical to tell users of mobile operating systems to install fonts that it is to tell users of desktop operating systems to install fonts.
So chances are that such sites no longer exist (in quantity that matters), but it’s hard to tell, and if they did, they’d work in Chrome (if using an encoding that ced knows about) but wouldn’t work with chardetng in Firefox. Instead of adding support for detecting such encodings as windows-1252, I made the .in and .lk top-level domains not run chardetng and simply fall back to windows-1252. (My understanding is that the font hacks were more about the Tamil language in India specifically than about the Tamil language generally, but I turned off chardetng for .lk just in case.) This kept the previous behavior of Firefox for these two TLDs. If the problem exists on .com/.net/.org, the problem is not solved there. Also, this has the slight downside of not detecting windows-1256 to the extent it is used under .in.
chardetng detects UTF-8 by checking if the input is valid as UTF-8. However, like Chrome, Firefox only honors this detection result for file: URLs. As in Chrome, for non-file: URLs, UTF-8 is never a possible detection outcome. If it was, Web developers could start relying on it, which would make the Web Platform more brittle. (The assumption is that at this point, Web developers want to use UTF-8 for new content, so being able to rely on legacy encodings getting detected is less harmful at this point in time.) That is, the user-facing problem of unlabeled UTF-8 is deliberately left unaddressed in order to avoid more instances of problematic content getting created.
As with the Quirks mode being the default and everyone having to opt into the Standards mode, and on mobile a desktop-like view port being the default and everyone having to opt into a mobile-friendly view port, for encodings legacy is the default and everyone has to opt into UTF-8. In all these cases, the legacy content isn’t going to be changed to opt out.
The full implications of “what if” UTF-8 was detected for non-file: URLs require a whole article on their own. The reason why file: URLs are different is that the entire content can be assumed to be present up front. The problems with the detecting UTF-8 on non-file: URLs relate to supporting incremental parsing and display of HTML as it arrives over the network.
When UTF-8 is detected on non-file: URLs, chardetng reports the encoding affiliated with the top-level domain instead. Various test cases, both test cases that intentionally test this and test cases that accidentally end up testing this, require windows-1252 to be reported for generic top-level domains when the content is valid UTF-8. Reporting the TLD-affiliated encoding as opposed to always reporting windows-1252 avoids needless reloads on TLDs that are affiliated with an encoding other than windows-1252.