This post is about a UI feature that I wish no one would have to use. Happily, it is indeed almost unused. Still, I made it more usable in the case when it is used. (The change was more driven by code removal than usability, though.) Anne asked me to document the situation, so here goes.
For historical reasons, HTML can be delivered over the network using various character encodings. The browser decodes the incoming HTML data to Unicode and needs to know what encoding to use for decoding. The encoding can be declared using a byte order mark (BOM) inside the HTML file, declared using a
<meta> tag inside the HTML file or declared using the
Content-Type HTTP header outside the HTML file. Or the browser can encounter content whose encoding is undeclared, in which case the browser needs to guess. Traditionally, the guessing is based on the browser localization, but Firefox now tries to first guess based on the top-level domain of the URL. For some locales (Japanese, Russian and Ukrainian in Firefox), the guessing is based on the content of the file rather than the locale itself alone.
The Character Encoding menu allows the user to reload the document with a different encoding to be used for decoding. Specifically, the use cases are:
Content-Type HTTP header, but the contents of the HTML file (or plain text file) are being loaded from the file system, so the HTTP header is not available, and the guess made by the browser was wrong.
Sadly, telemetry shows that the second use case is now more common than the first one. On the bright side, telemetry also shows that the menu is almost entirely unused. It is unused in more than 99.9% of Firefox sessions in the locale where it is used the most (Traditional Chinese) and it is unused in more than 99.99% of Firefox sessions in most locales. In a way, it is sad to even have to improve a feature like this instead of just removing it. I hope we can at least avoid adding it to Firefox OS.
The old menu implementation was very old. It was created on September 21, 1999. Back then, RDF was still a thing at Netscape, so the menu’s data came from an RDF data source. It seems that not everyone liked RDF even back then. By November 23, 1999, the documentation for the class implementing the data source said “God, our GUI programming disgusts me.”
The general attitude back then was to support a lot of encodings even without a strongly demonstrated Web compatibility need. Also, in addition to being used in the browser, both the menu and the general encoding back end were also used for the Composer HTML editor and the Mail/News client included in the Mozilla application suite. The code base gained a lot of encodings. Some were actually needed for the Web. Some were needed for email. Some were needed for converting Unicode to non-Unicode font encodings for rendering with pre-Unicode text rendering APIs. Some (in particular EUC-TW, it seems) were added to deal with file paths and the clipboard on some Unix flavors. Some encodings seem to have gotten added without a strong use case just because a standard existed. It also happened that the same encoding was added multiple times (e.g. TIS-620 and ISO-8859-11) or with slight variations under multiple names.
The large number of encodings led to attempts to manage the number first by organizing the encodings into submenus by region and then alleviating the problems created by the submenus by showing the most recent choices on the top-level and even providing editability (full with a decidated dialog!) for pinning some items to the top level.
Over time, some encodings were removed as completely useless and some encodings were removed or hidden as security problems, but overall, in the beginning of 2014, the menu was pretty much the way it was in 1999. By early 2014, Georgian GEOSTD8 had already been removed as not relevant to the Web and UTF-7, UTF-16, MacHebrew and MacArabic had been removed as cross-site scripting (XSS) hazards. Here’s the structure of the old menu from the beginning of 2014:
The old menu has so many problems I’m not even sure where to begin. Here are some problems. The list is not necessarily exhaustive.
The number of encodings was just overwhelming.
There were multiple menu choices that actually result in identical decoding. For example, all the three items for “Thai” resulted in the same decoding behavior!
There were encodings that aren’t really used for interchange often enough to have them in the menu and that were mainly used for pre-Unicode text rendering APIs in the past. Examples include JOHAB and various Mac encodings.
Icelandic, Romanian and Croatian are more commonly encoded using a language-unspecific encoding, but language-specific items baited the user to choose the wrong item.
Email or Usenet-oriented encodings were included in the menu for the browser even when the encodings are actually dangerous in the Web context: HZ, ISO-2022-CN and ISO-2022-KR.
There were encodings that aren’t really relevant for the Web or email, such as TCVN and VPS and no one even bothered to register with the IANA.
There were various encodings, especially DOS encodings, that someone bothered to register with the IANA but that aren’t really relevant to interchange today.
There were later-day ISO-8859 encodings that post-date UTF-8 and never reached broad use.
The division into West and East Europe was based on Cold War-era politics instead of geography.
Armenian and Georgian (when it was in the menu) were classified as Asian instead of dodging the question of where exactly in the Caucasus region the border of Europe and Asia is like Wikipedia does by saying that Armenia and Georgia are “located at the crossroads of Western Asia and Eastern Europe”.
More generally, the SE & SW Asian group was an unintuitive catch-all that grouped Turkish and Thai from different ends of Asia together.
Understanding that Vietnamese and Thai weren’t found under East Asian requires understanding that “CJK” form one group in the minds of domain experts and “East Asian” was just a UI string that meant “CJK” without exposing the jargon abbreviation.
There was an entry for User Defined, which doesn’t make sense as a user-chosen override.
The UI strings for Cyrillic encodings were inconsistent in whether they mention particular languages after a slash.
The entries were sorted in such a way that the most probable choice (Windows-125*) tends to come last.
“Nordic” and “South European” are encoding enthusiast inside baseball characterizations that do not necessarily match what users would consider Nordic or South European. You were not supposed to choose Nordic for Swedish, Danish, Norwegian, Finnish or Icelandic! The encoding is motivated by the Sami languages. You were not supposed to choose South European for e.g. Italian. It is for Maltese and Esperanto—the latter of which arguably doesn’t have a geographic affiliation.
The Auto-Detect menu had an enticing item called “Universal”, but it’s not actually universal.
The Auto-Detect menu had bewildering options for Chinese.
The Auto-Detect menu had detectors for Korean, Traditional Chinese and Simplified Chinese even though there is only one dominant legacy encoding for each!
At the end of 2014, the menu looks like this:
Clearly, the menu looks much better now. Particular things that are nice about the new menu include:
There are way fewer menu items.
The encodings are no longer spread into submenus.
When a particular adjective is associated with a single legacy encoding, there’s nothing in parenthesis to bother the user. In particular, the user might know the encoding by a different name than what is the preferred name according to the Encoding Standard. For example, for “Western”, the user might be more familiar with the name ISO-8859-1 than the names windows-1252. This is a non-issue when there’s nothing in parentheses.
When there is a Windows-125* encoding and an ISO-8859-* encoding for an adjective, the string in parenthesis is just “Windows” or “ISO” to avoid bothering the user with the numbers.
The most common choices, UTF-8 (labeled “Unicode”) and windows-1252 (labeled “Western”) are at the top of the menu and the rest of the items come alphabetically.
However, this stuff in parentheses is reverse-alphabetical to put Windows and Shift_JIS before ISO to put the more likely choice first. (There seems to be a bug that puts “Hebrew, Visual” before “Hebrew”, though.)
The menu is more keyboard-accessible (except on the Mac, of course, which is hostile to keyboard users) thanks to all the important items having access keys.
The Auto-Detect menu offers fewer detectors. The not-really-universal Universal is no longer there as an attractive nuisance. Since each of Korean, Simplified Chinese and Traditional Chinese now have only one item in the menu anyway, there’s no point in having detectors for them.
I took the following steps to come up with the new menu:
Got rid of RDF.
Got rid of the submenus (except for detectors, which I wish users used even less than the rest of the menu).
Removed all the detectors that were not on by default for any localization.
Removed all the encodings that are not part of the Encoding Standard.
Removed all the encodings that are not in the corresponding menu in IE11. The reasoning is two-fold. First, Web authors who omit declarations out of laziness can’t really be relying on the user fixing the page from the menu in the case of encodings that are not in the menu in another major browser. Second, hiding the encodings that are part of the Encoding Standard but that are not in the menu in IE11 gets rid of a bunch of encodings that are either hard to label in a way that users would understand or that would cause some of the adjectives to end up with multiple encodings that would be difficult to disambiguate.
Got rid of x-user-defined.
Put only GBK in the menu as “Chinese, Simplified”, since the two Simplified Chinese encodings that are left, GBK and GB18030, both use the same decoder—the decoder for GB18030. GBK is more conservative on the encoder side, so it’s a safer choice for a manual override.
Put only windows-1255 as “Hebrew” in the menu, because ISO-8859-8-I differs from it by one character, which is a currency sign. There’s no way a user will want to distinguish between the two when doing a manual override.
Moved the overwhelmingy most common two items at the top separated by a separator.
Sorted the rest as described above.
Gave access keys to all Windows encodings, all encodings that are the default for some locale and all Japanese encodings.
I’m not quite happy with the menu. In particular, I suspect that some “(ISO)” entries might be pretty useless, specifically the ones for Arabic, Baltic, Cyrillic and Greek. The Greek one is actually the fallback encoding used by the Greek Firefox localization and also the Greek Chrome localization, but it’s possible that this is a legacy arising from anti-Microsoft sentiment that doesn’t actually have much to do with the legacy content out there. The differences between Windows and ISO Greek are so small the chances are that guessing the ISO encoding works well enough with Windows-encoded legacy content, but guessing the Windows encoding and hiding the ISO encoding would be even more successful. In the case of the Arabic and Baltic ISO encodings, I doubt that they are used often enough that it’s worthwhile to have them in the menu considering that readers of Arabic, Cyrillic or Baltic text might waste time choosing the wrong option. Research into these matters would be appreciated.
Also, I am uncomfortable with having ISO-2022-JP in the menu. It has a structure that looks like an XSS hazard on its face. However, it has leaked from email to the Web, so it has some usage, and I have neither been able to develop nor seen anyone alse develop a proof-of-concept attack using it. If you want to get it out of the menu, the best bet is to show a proof-of-concept attack.
See also a sequel.