A Look at Encoding Detection and Encoding Menu Telemetry from Firefox 86

Firefox gained a way to trigger chardetng from the Text Encoding menu in Firefox 86. In this post, I examine both telemetry from Firefox 86 related to the Text Encoding menu and telemetry related to chardetng running automatically (without the menu).

The questions I’d like to answer are:

Can we replace the Text Encoding menu with a single menu item that performs the function currently performed by the item Automatic in the Text Encoding menu?
Does chardetng have to revise its guess often? (That is, is the guess made at one kilobyte typically the same as the guess made at the end of the stream? If not, there’s a reload.)
Does the top-level domain affect the guess often? (If yes, maybe it’s worthwhile to tune this area.)
Is unlabeled UTF-8 so common as to warrant further action to support it?
Is the unlabeled UTF-8 situation different enough for text/html and text/plain to warrant different treatment of text/plain?

The failure mode of decoding according to the wrong encoding is very different for the Latin script and for non-Latin scripts. Also, there are historical differences in UTF-8 adoption and encoding labeling in different language contexts. For example, UTF-8 adoption happened sooner for the Arabic script and for Vietnamese while Web developers in Poland and Japan had different attitudes towards encoding labeling early on. For this reason, it’s not enough to look at the global aggregation of data alone.

Since Firefox’s encoding behavior no longer depends on the UI locale and a substantial number of users use the en-US localization in non-U.S. contexts, I use geographic location rather than the UI locale as a proxy for the legacy encoding family of the Web content primary being read.

The geographical breakdown of telemetry is presented in the tables by ISO 3166-1 alpha-2 code. The code is deduced from the source IP addresses of the telemetry submissions at the time of ingestion after which the IP address itself is discarded. As another point relevant to make about privacy, the measurements below referring to the .jp, .in, and .lk TLDs is not an indication of URL collection. The split into four coarse categories, .jp, .in+.lk, other ccTLD, and non-ccTLD, was done on the client side as a side effect of these four TLD categories getting technically different detection treatment: .jp has a dedicated detect o r, .in and .lk don’t run detection at all, for other ccTLDs the TLD is one signal taken into account, and for other TLDs the detection is based on the content only. (It’s imaginable that there could be regional differences in how willing users are to participate in telemetry collection, but I don’t know if there actually are regional differences.)

Menu Usage

Starting with 86, Firefox has a probe that measures if the item “Automatic” in the Text Encoding menu has been used at least once in a given subsession. It also has another probe measuring whether any of the other (manual) items in the Text Encoding menu has been used at least once in a given subsession.

Both the manual selection and the automatic selection are used at the highest rate in Japan. The places with the next-highest usage rates are Hong Kong and Taiwan. The manual selection is still used in more sessions that the automatic selection. In Japan and Hong Kong, the factor is less than 2. In Taiwan, it’s less than 3. In places where the dominant script is the Cyrillic script, manual selection is relatively even more popular. This is understandable, considering that the automatic option is a new piece of UI that users probably haven’t gotten used to, yet.

All in all, the menu is used rarely relative to the total number of subsessions, but I assume the usage rate in Japan still makes the menu worth keeping considering how speedy feedback from Japan is whenever I break something in this area. Even though the menu usage seems very rare, with a large number of users, a notable number of users daily still find the need to use the menu.

Japan is a special case, though, since we have have a dedicated detector that runs on the .jp TLD. The menu usage rates in Hong Kong and Taiwan are pretty close to the rate in Japan, though.

In retrospect, it’s unfortunate that the new probes for menu usage frequency can’t be directly compared with the old probe, because we now have distinct probes for the automatic option being used at least once per subsession and a manual option being used at least once per subsession and both a manual option and the automatic option could be used in the same Firefox subsession. We can calculate changes assuming the extreme cases: the case where the automatic option is always used in a subsession together with a manual option and the case where they are always used in distinct subsessions. This gives us worst case and best case percentages of 86 menu use rate compared to 71 menu use rate. (E.g. 89% means than the menu was used 11% less in 86 than in 71.) The table is sorted by the relative frequency of use of the automatic option in Firefox 86. The table is not exhaustive. It is filtered both to objectively exclude rows by low number of distinct telemetry submitters and semi-subjectively to exclude encoding-wise similar places or places whose results seemed noisy. Also, Germany, India, and Italy are taken as counter-examples of places that are notably apart from the others in terms of menu usage frequency and India being encoding-wise treated specially.

	Worst case	Best case
JP	89%	58%
TW	63%	46%
HK	61%	40%
CN	80%	54%
TH	82%	66%
KR	72%	53%
UA	206%	167%
BG	112%	99%
RU	112%	82%
SG	59%	46%
GR	91%	69%
IL	92%	80%
IQ	24%	13%
TN	15%	10%
EE	63%	43%
TR	102%	61%
HU	109%	77%
LV	88%	72%
LT	67%	53%
EG	39%	28%
VN	41%	35%
DE	90%	65%
IN	108%	77%
IT	83%	55%

The result is a bit concerning. According to the best case numbers, things got better everywhere except in Ukraine. The worst case numbers suggest that things might have gotten worse also in other places where the Cyrillic script is the dominant script as well as in Turkey and Hungary where the dominant legacy encoding is known to be tricky to distinguish from windows-1252, and in India, whose domestic ccTLD is excluded from autodetection. Still, the numbers for Russia, Hungary, Turkey, and India look like things might have stayed the same or gotten a bit better.

At least in the case of the Turkish and Hungarian languages, the misdetection of the encoding is going to be another Latin-script encoding anyway, so the result is not catastrophic in terms of user experience. You can still figure out what the text is meant to say. For any non-Latin script, including the Cyrillic script, misdetection makes the page completely unreadable. In that sense, the numbers for Ukraine are concerning.

In the case of India, the domestic ccTLD, .in, is excluded from autodetection and simply falls back to windows-1252 like it used to. Therefore, for users in India, the added autodetection applies only on other TLDs, including to content published from within India on generic TLDs. We can’t really conclude anything in particular about changes to the browser user experience in India itself. However, we can observe that with the exception of Ukraine, the other case where the worst case was over 100%, the worst case was within the same ballpark as the worst case for India, where the worst case may not be meaningful, so maybe the other similar worst case results don’t really indicate things getting substantially worse.

To understand how much menu usage in Ukraine has previously changed from version to version, I looked at the old numbers from Firefox 69, 70, 71, 74, 75, and 75. chardetng landed in Firefox 73 and settled down by Firefox 78. The old telemetry probe expired, which is why we don’t have data from Firefox 85 to compare with.

	69	70	71	74	75	76
69	100%	87%	70%	75%	75%	73%
70	115%	100%	81%	87%	86%	83%
71	143%	124%	100%	107%	106%	103%
74	133%	115%	93%	100%	99%	96%
75	134%	117%	94%	101%	100%	97%
76	138%	120%	97%	104%	103%	100%

In the table, the percentage in the cell is the usage rate in the version from the column relative to the version from the row. E.g. in version 70, the usage was 87% of the usage in version 69 and, therefore, decreased by 13%.

This does make even the best-case change from 71 to 86 for Ukraine look like a possible signal and not noise. However, the change from 71 to 74, 75, and 76, representing the original landing of chardetng, was substantially milder. Furthermore, the difference between 69 and 71 was larger, which suggests that the fluctuation between versions may be rather large.

It’s worth noting that with the legacy encoded data synthesized from the Ukrainian Wikipedia, chardetng is 100% accurate with document-length inputs and 98% accurate with title-length inputs. This suggests that the problem might be something that cannot be remedied by tweaking chardetng. Boosting Ukrainian detection without a non-Wikipedia corpus to evaluate with would risk breaking Greek detection (the other non-Latin bicameral script) without any clear metric of how much to boost Ukrainian detection.

Menu Usage Situation

Let’s look at what the situation where the menu (either the automatic option or a manual option) was used was like. This is recorded relative to the top-level page, so this may be misleading if the content that motivate the user to use the menu was actually in a frame.

First, let’s describe the situations. Note that Firefox 86 did not honor bogo-XML declarations in text/html, so documents whose only label was in a bogo-XML declaration count as unlabeled.

ManuallyOverridden: The encoding was already manually overridden. That is, the user was unhappy with their previous manual choice. This gives an idea of how users need to iterate with manual choices.
AutoOverridden: The encoding was already overridden with the automatic option. This suggests that either chardetng guessed wrong or the problem that the user is seeing cannot be remedied by the encoding menu. (E.g. UTF-8 content misdecoded as windows-1252 and then re-encoded as UTF-8 cannot be remedied by any choice from the menu.)
UnlabeledNonUtf8TLD: Unlabeled non-UTF-8 content containing non-ASCII was loaded from a ccTLD other than .jp, .in, or .lk, and the TLD influenced chardetng’s decision. That is, the same bytes served from a .com domain would have been detected differently.
UnlabeledNonUtf8: Unlabeled non-UTF-8 content containing non-ASCII was loaded from a TLD other than .jp, .in, or .lk, and the TLD did not influence chardetng’s decision. (The TLD may have been either a ccTLD that didn’t end up contributing to the decision or a generic TLD.)
LocalUnlabeled: Unlabeled non-UTF-8 content from a file: URL.
UnlabeledAscii: Unlabeled (remote; i.e. non-file:) content that was fully ASCII, excluding the .jp, .in, and .lk TLDs. This indicates that either the problem the user attempted to remedy was in a frame or was a problem that the menu cannot remedy.
UnlabeledInLk: Unlabeled content (ASCII, UTF-8, or ASCII-compatible legacy) from either the .in or .lk TLDs.
UnlabeledJp: Unlabeled content (ASCII, UTF-8, or ASCII-compatible legacy) from the .jp TLD. The .jp-specific detector, which detects among the three Japanese legacy encodings, ran.
UnlabeledUtf8: Unlabeled content (outside the .jp, .in, and .lk TLDs) that was actually UTF-8 but was not automatically decoded as UTF-8 to avoid making the Web Platform more brittle. We know that there is an encoding problem for sure and we know that choosing either “Automatic” or “Unicode” from the menu resolves it.
ChannelNonUtf8: An ASCII-compatible legacy encoding or ISO-2022-JP was declared on the HTTP layer.
ChannelUtf8: UTF-8 was declared on the HTTP layer but the content wasn’t valid UTF-8. (The menu is disabled if the top-level page is declared as UTF-8 and is valid UTF-8.)
MetaNonUtf8: An ASCII-compatible legacy encoding or ISO-2022-JP was declared in meta (in the non-file: case).
MetaUtf8: UTF-8 was declared in meta (in the non-file: case) but the content wasn’t valid UTF-8. (The menu is disabled if the top-level page is declared as UTF-8 and is valid UTF-8.)
LocalLabeled: An encoding was declared in meta in a document loaded from a file: URL and the actual content wasn’t valid UTF-8. (The menu is disabled if the top-level page is declared as UTF-8 and is valid UTF-8.)
Bug: A none-of-the-above situation that was not supposed to happen and, therefore, is a bug in how I set up the telemetry collection.

	ManuallyOverridden	AutoOverridden	UnlabeledNonUtf8TLD	UnlabeledNonUtf8	LocalUnlabeled	UnlabeledAscii	UnlabeledInLk	UnlabeledJp	UnlabeledUtf8	ChannelNonUtf8	ChannelUtf8	MetaNonUtf8	MetaUtf8	LocalLabeled	Bug
Global	8.7%	2.3%	0.3%	2.6%	2.1%	6.7%	0.4%	6.3%	30.0%	12.6%	16.4%	4.4%	0.8%	4.8%	1.6%
JP	6.5%	2.7%	0.1%	3.5%	1.4%	4.7%	0.7%	22.7%	15.8%	9.5%	19.7%	3.0%	0.7%	6.5%	2.4%
HK	15.9%	5.5%	0.5%	2.9%	4.8%	6.0%	0.0%	0.0%	34.8%	7.1%	14.0%	4.2%	0.7%	1.3%	2.3%
TW	14.2%	4.4%	0.4%	2.1%	6.3%	7.9%	0.0%	0.1%	30.2%	7.9%	16.6%	4.8%	1.0%	3.3%	1.0%
CN	7.0%	1.7%	0.4%	2.0%	0.9%	5.9%	0.0%	0.0%	56.8%	7.2%	7.5%	4.4%	0.9%	2.6%	2.4%
TH	7.9%	3.1%	0.6%	1.6%	2.3%	9.3%	0.0%	0.4%	17.7%	25.8%	15.8%	10.3%	1.0%	3.5%	0.7%
KR	8.8%	3.1%	0.1%	1.2%	3.2%	6.7%	0.6%	0.0%	39.7%	11.6%	15.8%	3.2%	1.1%	3.2%	1.8%
UA	11.5%	2.3%	0.6%	0.4%	2.0%	7.7%	0.0%	0.0%	32.9%	14.8%	17.0%	2.9%	0.0%	6.7%	1.3%
BG	8.1%	2.8%	0.0%	2.0%	2.4%	4.9%	0.0%	0.0%	22.9%	14.8%	26.9%	4.5%	0.0%	3.4%	7.3%
RU	11.1%	1.3%	0.4%	1.2%	1.6%	3.8%	0.0%	0.0%	33.3%	21.3%	17.1%	1.6%	0.4%	6.0%	0.8%
BY	10.9%	1.2%	1.6%	1.4%	0.4%	4.5%	0.0%	0.0%	27.8%	23.6%	15.1%	5.1%	1.5%	6.2%	0.8%
SG	12.5%	3.2%	0.0%	1.6%	6.9%	7.5%	0.0%	0.0%	38.1%	13.1%	12.3%	2.5%	0.0%	1.7%	0.6%
GR	14.6%	1.5%	0.3%	2.7%	8.3%	6.1%	0.0%	0.0%	25.5%	7.4%	22.6%	3.0%	0.9%	6.3%	0.9%
IL	16.7%	2.0%	0.0%	1.2%	4.5%	16.5%	0.0%	0.0%	24.7%	13.3%	14.1%	4.8%	0.0%	2.4%	0.0%
BR	5.6%	2.5%	0.3%	1.8%	0.3%	4.3%	0.0%	0.0%	7.1%	38.7%	26.1%	5.9%	0.7%	5.6%	1.0%
HU	9.0%	2.4%	1.0%	2.4%	1.6%	3.4%	0.0%	0.0%	26.8%	4.6%	28.9%	6.7%	4.6%	5.9%	2.8%
CZ	10.0%	3.8%	0.0%	1.1%	3.0%	3.2%	0.0%	0.0%	25.5%	11.3%	27.3%	3.2%	1.3%	9.4%	0.9%
DE	8.3%	2.9%	0.4%	2.2%	1.8%	5.6%	0.0%	0.2%	17.8%	18.9%	24.5%	8.5%	1.5%	5.2%	2.2%
IN	7.2%	2.0%	0.0%	0.6%	1.8%	7.6%	12.7%	0.0%	6.7%	40.6%	5.2%	9.2%	0.4%	3.4%	2.6%

The cases AutoOverridden, UnlabeledNonUtf8TLD, UnlabeledNonUtf8, and LocalUnlabeled represent cases that are suggestive of chardetng having been wrong (or the user misdiagnosing the situation). These cases together are in the minority relative to the other cases. Notably, their total share is very near the share of UnlabeledAscii, which is probably more indicative of how often users misdiagnose what they see as remedyable via the Text Encoding menu than as indicative of sites using frames. However, I have no proof either way of whether this represents misdiagnosis by the user more often or frames more often. In any case, having potential detector errors be in the same ballbark as cases where the top-level page is actually all-ASCII is a sign of the detector probably being pretty good.

The UnlabeledAscii number for Israel stands out. I have no idea why. Are frames more common there? Is it a common pattern to programmatically convert content to numeric character references? If the input to such conversion has been previously misdecoded, the result looks like an encoding error to the user but cannot be remedied from the menu.

Globally, the dominant case is UnlabeledUtf8. This is sad in the sense that we could automatically fix this case for users if there wasn’t a feedback loop to Web author behavior. See a separate write-up on this topic. Also, this metric stands out for mainland China. We’ll also come back to other metrics related to unlabeled UTF-8 standing out in the case of mainland China.

Mislabeled content is a very substantial reason for overriding the encoding. For the ChannelNonUtf8, MetaNonUtf8, and LocalLabeled the label was either actually wrong or the user misdiagnosed the situation. For the UnlabeledUtf8 and MetaUtf8, we can very confident that there was an actual authoring-side error. Unsurprisingly, overriding an encoding labeled on the HTTP layer is much more common that overriding the encoding labeled within the file. This supports the notion that Ruby’s Postulate is correct.

Note that number for UnlabeledJp in Japan does not indicate that the dedicated Japanese detector is broken. The number could represent unlabeled UTF-8 on the .jp TLD, since the .jp TLD is excluded from the other columns.

The relatively high numbers for ManuallyOverridden indicate that users are rather bad at figuring out on the first attempt what they should choose from the menu. When chardetng would guess right, not giving users the manual option would be an usability improvement. However, in cases where nothing in the menu solves the problem, there’s a cohort of users who are unhappy about software deciding for them that there is no solution and are happier by manually coming to the conclusion that there is no solution. For them, an objective usability improvement could feel patronizing. Obviously, when chardetng would guess wrong, not providing manual recourse would make things substiantially worse.

It’s unclear what one should conclude from the AutoOverridden and LocalUnlabeled numbers. They can represent case where chardetng actually guesses wrong or it could also represent cases where the manual items don’t provide a remedy, either. E.g. none of the menu items remedies UTF-8 having been decoded as windows-1252 and the result having been encoded as UTF-8. The higher numbers for Hong Kong and Taiwan look like a signal of a problem. Because mainland China and Singapore don’t show a similar issue, it’s more likely that the signal for Hong Kong and Taiwan is about Big5 rather than GBK. I find this strange, because Big5 should be structurally distinctive enough for the guess to be right if there is an entire document of data to make the decision from. One possibility is that Big5 extensions, such as Big5-UAO, whose character allocations the Encoding Standard treats as unmapped are more common in legacy content than previously thought. Even one such extension character causes chardetng to reject the document as not Big5. I have previously identified this as a potential risk. Also, it is strange that LocalUnlabeled is notably higher than global also for Singapore, Greece, and Israel, but these don’t show a similar difference on the AutoOverridden side.

The Bug category is concerningly high. What have I missed when writing the collection code? Also, how is it so much higher in Bulgaria?

Non-Menu Detector Outcomes

Next, let’s look an non-menu detection scenarios: What’s the relative frequency of non-file: non-menu non-ASCII chardetng outcomes? (Note that this excludes the .jp, .in, and .lk TLDs. .jp runs a dedicated detector instead of chardetng and no detector runs on .in and .lk.)

Here are the outcomes (note that ASCII-only outcomes are excluded):

UtfInitial: The detector knew that the content was UTF-8 and the decision was made from the first kilobyte. (However, a known-wrong TLD-affiliated legacy encoding was used instead in order to avoid making the Web Platform more brittle.)
UtfFinal: The detector knew that the content was UTF-8, but the first kilobyte was not enough to decide. That is, the first kilobyte was ASCII. (However, a known-wrong TLD-affiliated legacy encoding was used instead in order to avoid making the Web Platform more brittle.)
TldInitial: The content was non-UTF-8 and the decision was affected by the ccTLD. That is, the same bytes on .com would have been decided differently. The decision that was made once the first kilobyte was seen remained the same when the whole content was seen.
TldFinal: The content was non-UTF-8 and the decision was affected by the ccTLD. That is, the same bytes on .com would have been decided differently. The guess was made once the first kilobyte was seen differed from the eventual decision that was made when the whole content had been seen.
ContentInitial: The content was non-UTF-8 on a ccTLD, but the decision was not affected by the TLD. That is, the same content on .com would have been decided the same way. The decision that was made once the first kilobyte was seen remained the same when the whole content was seen.
ContentFinal: The content was non-UTF-8 on a ccTLD, but the decision was not affected by the TLD. That is, the same content on .com would have been decided the same way. The guess was made once the first kilobyte was seen differed from the eventual decision that was made when the whole content had been seen.
GenericInitial: The content was non-UTF-8 on a generic TLD. The decision that was made once the first kilobyte was seen remained the same when the whole content was seen.
GenericFinal: The content was non-UTF-8 on a generic TLD. The guess was made once the first kilobyte was seen differed from the eventual decision that was made when the whole content had been seen.

The rows are grouped by the most detection-relevant legacy encoding family (e.g. Singapore is grouped according to Simplified Chinese) sorted by Windows code page number and the rows within a group are sorted by the ISO 3166 code. The places selected for display are either exhaustive exemplars of a given legacy encoding family or, when not exhaustive, either large-population exemplars or detection-wise remarkable cases. (E.g. Icelandic is detection-wise remarkable, which is why Iceland is shown.)

text/html

		UtfInitial	UtfFinal	TldInitial	TldFinal	ContentInitial	ContentFinal	GenericInitial	GenericFinal
Global		12.7%	66.6%	1.0%	0.0%	9.3%	0.1%	9.0%	1.3%
Thai	TH	17.0%	68.2%	0.4%	0.0%	5.8%	0.1%	7.0%	1.5%
Japanese	JP	13.0%	72.4%	0.0%	0.0%	0.8%	0.0%	13.1%	0.5%
Simplified Chinese	CN	13.7%	17.3%	0.2%	0.0%	7.0%	0.1%	61.1%	0.6%
Simplified Chinese	SG	14.7%	69.5%	0.9%	0.0%	1.8%	0.3%	11.2%	1.6%
Korean	KR	23.8%	30.2%	0.4%	0.0%	22.2%	0.1%	21.6%	1.8%
Traditional Chinese	HK	13.5%	56.3%	0.5%	0.0%	3.6%	0.1%	24.4%	1.6%
	MO	27.9%	46.5%	0.4%	0.0%	2.8%	0.0%	21.4%	0.9%
	TW	9.3%	75.8%	0.3%	0.0%	6.3%	0.1%	7.7%	0.5%
Central European	CZ	12.6%	49.6%	0.7%	0.0%	33.6%	0.1%	2.5%	0.9%
	HU	15.1%	48.0%	18.4%	0.3%	1.4%	1.2%	13.4%	2.2%
	PL	15.8%	72.5%	3.7%	0.1%	3.1%	0.4%	3.0%	1.5%
	SK	23.6%	61.5%	1.2%	0.0%	8.7%	0.1%	3.7%	1.2%
Cyrillic	BG	9.7%	81.8%	0.4%	0.0%	2.3%	0.1%	4.4%	1.4%
	RU	6.2%	91.0%	0.1%	0.0%	1.7%	0.0%	0.8%	0.2%
	UA	6.4%	86.0%	0.2%	0.0%	4.0%	0.1%	2.6%	0.6%
Western	BR	22.9%	44.8%	2.9%	0.0%	26.7%	0.0%	2.3%	0.4%
	CA	19.6%	61.4%	0.8%	0.0%	3.9%	0.0%	11.3%	2.9%
	DE	14.0%	65.6%	0.5%	0.0%	15.1%	0.0%	4.1%	0.7%
	ES	4.2%	75.1%	1.4%	0.0%	7.3%	0.0%	11.1%	0.9%
	FR	6.4%	70.5%	0.3%	0.0%	14.5%	0.0%	7.7%	0.6%
	GB	10.5%	84.5%	0.7%	0.0%	1.1%	0.0%	2.2%	0.9%
	IS	46.2%	39.3%	0.3%	0.0%	5.5%	0.0%	7.8%	0.8%
	IT	8.8%	73.1%	0.6%	0.0%	11.3%	0.1%	5.3%	1.0%
	US	12.8%	72.1%	0.4%	0.0%	1.4%	0.0%	10.5%	2.7%
Greek	GR	12.0%	71.4%	5.8%	0.0%	2.3%	0.8%	4.5%	3.2%
Turkic	AZ	7.3%	86.3%	0.3%	0.0%	1.8%	0.1%	3.2%	1.1%
Turkic	TR	19.5%	59.4%	1.6%	0.0%	8.8%	0.2%	7.9%	2.6%
Hebrew	IL	6.9%	79.9%	0.6%	0.0%	6.8%	0.1%	4.3%	1.5%
Arabic-script	EG	5.5%	75.9%	0.3%	0.0%	1.3%	0.1%	5.0%	11.8%
	PK	2.2%	86.4%	4.0%	1.6%	0.8%	0.1%	3.4%	1.4%
	SA	9.1%	80.2%	0.6%	0.0%	1.2%	0.1%	4.8%	4.1%
Baltic	EE	21.6%	67.2%	0.4%	0.0%	6.5%	0.1%	3.2%	1.0%
	LT	48.6%	47.1%	0.8%	0.1%	1.3%	0.1%	1.4%	0.6%
	LV	6.4%	87.2%	0.4%	0.0%	3.0%	0.1%	2.1%	0.7%
Vietnamese	VN	19.7%	67.4%	1.1%	0.0%	1.5%	0.2%	7.9%	2.1%
Other	AM	7.9%	85.2%	0.4%	0.0%	2.5%	0.0%	3.0%	0.9%
	ET	2.8%	85.2%	1.1%	0.0%	2.6%	0.1%	6.2%	2.0%
	GE	10.7%	82.9%	0.3%	0.0%	1.8%	0.1%	3.2%	1.0%
	IN	11.6%	69.7%	0.6%	0.0%	1.5%	0.1%	12.6%	3.9%
	LK	3.4%	89.6%	0.2%	0.0%	0.5%	0.1%	4.7%	1.4%

text/plain

		UtfInitial	UtfFinal	TldInitial	TldFinal	ContentInitial	ContentFinal	GenericInitial	GenericFinal
Global		15.8%	71.5%	0.6%	0.0%	4.1%	0.2%	6.9%	0.8%
Thai	TH	12.2%	54.9%	5.6%	0.0%	3.6%	0.1%	22.1%	1.6%
Japanese	JP	15.7%	28.7%	0.1%	0.0%	1.5%	0.1%	51.6%	2.4%
Simplified Chinese	CN	14.1%	70.6%	1.1%	0.1%	2.7%	0.1%	10.8%	0.6%
Simplified Chinese	SG	10.3%	73.7%	0.7%	0.0%	1.9%	0.1%	12.0%	1.2%
Korean	KR	2.8%	5.1%	0.3%	0.1%	89.8%	0.0%	1.7%	0.2%
Traditional Chinese	HK	14.0%	70.6%	0.5%	0.1%	2.7%	0.1%	10.8%	1.2%
	MO	13.8%	69.7%	0.6%	0.0%	3.4%	0.0%	12.6%	0.0%
	TW	20.4%	45.6%	3.9%	0.1%	11.8%	0.1%	16.7%	1.5%
Central European	CZ	25.7%	69.7%	0.9%	0.1%	1.3%	0.0%	2.1%	0.2%
	HU	19.9%	53.8%	12.8%	0.2%	2.5%	1.0%	8.4%	1.5%
	PL	28.5%	61.4%	2.2%	0.1%	2.3%	0.3%	4.6%	0.6%
	SK	28.6%	46.4%	3.2%	0.1%	8.1%	0.5%	11.2%	1.8%
Cyrillic	BG	14.4%	47.2%	1.8%	0.3%	17.5%	0.1%	17.0%	1.8%
	RU	25.8%	58.2%	2.5%	0.0%	4.1%	0.3%	8.1%	1.0%
	UA	22.6%	46.4%	3.1%	0.0%	6.9%	0.1%	19.3%	1.5%
Western	BR	21.6%	53.8%	0.6%	0.0%	15.0%	0.2%	6.9%	1.9%
	CA	75.3%	20.9%	0.1%	0.0%	0.6%	0.0%	2.7%	0.5%
	DE	13.8%	62.4%	0.3%	0.0%	13.6%	1.0%	7.8%	1.1%
	ES	17.5%	60.3%	0.4%	0.0%	5.9%	0.2%	14.6%	1.1%
	FR	24.2%	61.5%	0.2%	0.0%	4.7%	0.1%	8.5%	0.7%
	GB	2.2%	92.5%	0.1%	0.0%	1.6%	0.1%	3.0%	0.5%
	IS	13.2%	65.7%	0.5%	0.0%	11.6%	0.0%	8.1%	0.8%
	IT	9.7%	73.6%	0.5%	0.0%	7.2%	0.2%	7.9%	0.9%
	US	6.0%	83.5%	0.1%	0.0%	1.0%	0.3%	7.7%	1.3%
Greek	GR	25.6%	52.9%	6.9%	0.1%	1.6%	1.3%	10.1%	1.4%
Turkic	AZ	17.6%	58.3%	1.6%	0.5%	2.6%	0.0%	18.6%	0.9%
Turkic	TR	7.3%	80.7%	1.0%	0.0%	1.7%	0.0%	8.2%	1.1%
Hebrew	IL	14.1%	67.0%	1.6%	0.1%	1.7%	0.1%	13.0%	2.4%
Arabic-script	EG	13.1%	47.1%	1.8%	0.0%	1.5%	0.3%	33.8%	2.4%
	PK	10.2%	68.2%	2.2%	0.0%	1.2%	0.1%	16.5%	1.7%
	SA	14.7%	58.7%	16.5%	0.0%	1.2%	0.1%	7.5%	1.4%
Baltic	EE	49.9%	37.7%	0.4%	0.0%	3.7%	0.0%	7.1%	1.1%
	LT	26.5%	59.9%	3.7%	0.2%	1.9%	0.1%	6.7%	0.9%
	LV	15.2%	58.2%	9.8%	0.2%	3.5%	0.2%	11.6%	1.4%
Vietnamese	VN	12.8%	60.5%	2.1%	0.2%	1.7%	0.5%	20.9%	1.3%
Other	AM	16.6%	59.7%	0.7%	0.0%	5.9%	0.2%	15.6%	1.2%
	ET	15.2%	61.6%	0.3%	0.0%	3.9%	0.3%	15.2%	3.5%
	GE	12.6%	56.8%	1.4%	0.0%	14.5%	0.0%	12.4%	2.3%
	IN	9.6%	67.7%	0.3%	0.0%	1.4%	0.1%	18.8%	2.2%
	LK	9.0%	63.3%	0.2%	0.0%	1.2%	0.0%	23.1%	3.2%

Observations

Recall that for Japan, India, and Sri Lanka, the domestic ccTLDs (.jp, .in, and .lk, respectively) don’t run chardetng, and the table above covers only chardetng outcomes. Armenia, Ethiopia, and Georgia are included as examples where, despite chardetng running on the domestic ccTLD, the primary domestic script has no Web Platform-supported legacy encoding.

When the content is not actually UTF-8, the decision is almost always made from the first kilobyte. We can conclude that the chardetng doesn’t reload too much.

GenericFinal for HTML in Egypt is the notable exception. We know from testing with synthetic data that chardetng doesn’t perform well for short inputs of windows-1256. This looks like a real-world confirmation.

The TLD seems to have the most effect in Hungary, which is unsuprising, because it’s hard to make the detector detect Hungarian from the content every time without causing misdetection of other Latin-script encodings.

The most surprising thing in these results is that unlabeled UTF-8 is encountered relatively more commonly than unlabeled legacy encodings, but this is so often detected only after the first kilobyte. If this content was mostly in the primary language of the places listed in the table, UTF-8 should be detected from the first kilobyte. I even re-checked the telemetry collection code on this point to see that the collection works as expected.

Yet, the result of most unlabeled UTF-8 HTML being detected after the first kilobyte repeats all over the world. The notably different case that stands out is mainland China, where the total of unlabeled UTF-8 is lower than elsewhere even if the late detection is still a bit more common than early detection. Since the phenomenon occurs in places where the primary script is not the Latin script but mainland China is different, my current guess is that unlabeled UTF-8 might be dominated by an ad network that operates globally with the exception of mainland China. This result could be caused by ads that have more than a kilobyte of ASCII code and a copyright notice at the end of the file. (Same-origin iframes inherit the encoding from their parent instead of running chardetng. Different-origin iframes, such as ads, could be represented in these numbers, though.)

I think the next step is to limit these probes to top-level navigations only to avoid the participation of ad iframes in these numbers.

Curiously, the late-detected unlabeled UTF-8 phenomenon extends to plain text, too. Advertising doesn’t plausibly explain plain text. This suggest that plain-text loads are dominanted by something other than local-language textual content. To the extent scripts and stylesheets are viewed as documents that are navigated to, one would expect copyright legends to typically appear at the top. Could plain text be dominated by mostly-ASCII English regardless of where in the world users are? The text/plain UTF-8 result for the United Kingdom looks exactly like one would expect for English. But why is the UTF-8 text/plain situation so different from everywhere else in South Korea?

Conclusions

Let’s go back to the questions:

Can We Replace the Text Encoding Menu with a Single Menu Item?

Most likely yes, but before doing so, it’s probably a good idea to make chardetng tolerate Big5 byte pairs that conform to the Big5 byte pattern but that are unmapped in terms of the Encoding Standard.

Replacing the Text Encoding menu would probably improve usability considering how the telemetry suggests that users are bad at making the right choice from the menu and bad at diagnosing whether the problem they are seeing can be addressed by the menu. (If the menu had only the one item, we’d be able to disable the menu more often, since we’d be able to better conclude ahead of time that it won’t have an effect.)

Does chardetng have to revise its guess often?

No. For legacy encodings, one kilobyte is most often enough. It’s not worthwhile to make adjustments here.

Does the Top-Level Domain Affect the Guess Often?

It affects the results often in Hungary, which is expected, but not otherwise. Even though the TLD-based adjustments to detection are embarrassingly ad hoc, the result seems to work well enough that it doesn’t make sense to put effort into tuning this area better.

Is Unlabeled UTF-8 So Common as to Warrant Further Action to Support It?

There is a lot of unlabeled UTF-8 encountered relative to unlabeled non-UTF-8, but the unlabeled UTF-8 doesn’t appear to be normal text in the local language. In particular, the early vs. late detection telemetry doesn’t vary in the expected way when the primary local language is near-ASCII-only and when the primary local language uses a non-Latin script.

More understanding is needed before drawing more conclusions.

Is the Unlabeled UTF-8 Situation Different Enough for text/html and text/plain to Warrant Different Treatment of text/plain?

More understanding is needed before drawing conclusions. The text/plain and text/html cases look strangely similar even though the text/plain cases are unlikely to be explainable as advertising iframes.

Action items

Keep an eye on the menu usage telemetry for Ukraine over the next releases.
Limit the (non-menu) text/html and text/plain outcome probes to top-level navigations only. Then take another look at the UTF-8 issues.
Make the Big5 detector tolerate byte pairs that conform to the Big5 pattern but that are unmapped in the Encoding Standard.
- Bonus: Consider tolerating, for detection purposes only, other legacy CJK extensions that diverge from the Encoding Standard (unclear if these are worthwhile):
  - Windows EUDC in EUC-KR
  - JIS X 0213 extensions to the NEC row in both EUC-JP and Shift_JIS
  - MacJapanese in Shift_JIS
  - MacKorean in EUC-KR
  - Mac System 7 single-byte extensions to GBK
  - Mac System 7 single-byte extensions to Big5 (those that don’t overlap with HKSCS and EUDC)
Proceed with replacing the Text Encoding menu with its “Automatic” item (renamed to “Override Text Encoding”).