In early 2019, I found myself in a situation where I needed to check that I hadn’t broken IME integration code. Later in 2019, I needed to do it again and now I'm testing this again in 2020, so I’m writing this down.
This is “Did I break things?” smoke testing advice for software developers who don’t themselves use an IME daily or for IME users who need to also test other IMEs that they don’t themselves use regularly. This is not a guide to building things with IME APIs. Also, obviously, writing one word with each IME isn’t the same level of testing as actually writing a lot of text using an IME daily. Once you’ve checked that things aren’t totally broken, you should get your code tested by actual daily users of various IMEs.
IME stands for Input Method Editor, which is an old Windows term. However, these days IME is colloquially used regardless of operating system. An IME is a piece of software that transforms user-generated input events (mostly keyboard events, but some IMEs allow some auxiliary pointing device interaction) into text in a manner more complex than a mere keyboard layout. Basically, if the relationship between the keys that a user presses on a hardware keyboard and the text that ends up in an applications text buffer is more complex than when writing French, an IME is in use.
Notably, this is a matter of complexity of the mapping from input events into text in memory. This is not a matter of complexity of the mapping from memory to display. In particular, the mapping from keys to Unicode scalar values for e.g. Arabic is less complex than for French even though the display is more complex.
The above definition is incomplete without defining the capabilities of a keyboard layout, so this digression to keyboard layouts seems necessary for completeness. On the basic level, a keyboard layout provides a mapping from key codes to Unicode scalar values with modifier keys like shift and alt gr (option on Mac) taken into account.
On the basic level, if the modifiers shift and alt gr (option on Mac) are allowed (whether more modifiers are technically possible is outside the scope of this article), the non-modifier keys get four possible mappings to Unicode scalar values: no modifier, with shift, with alt gr/option, and with both shift and alt gr/option. On Windows (since XP) and Gtk, the mapping indeed is to Unicode scalar values rather than to UTF-16 code units. On macOS, key strokes can generate Unicode strings, but those strings are typically single-character strings.
The previous sentences were qualified by “on the basic level”. There is an added complication: dead keys. A dead key represents an accent such that pressing the key does not yet produce the accent, but the next key press produces an accented letter. Note that this is not just about swapping the key press order of a base character and a combining accent. By convention, the output is a single precomposed Unicode scalar value as opposed to the output being the base character followed by a combining accent.
For example, to type ô on a French AZERTY keyboard, you first press the key whose US QWERTY keycap says [, which is a dead key for circumflex in French AZERTY, and then you press the key o (the same key as on US QWERTY). The output after pressing the second key is U+00F4 LATIN SMALL LETTER O WITH CIRCUMFLEX (i.e. precomposed in Normalization Form C).
For historical reasons, Win32 and Gtk don’t treat the dead key mechanism as an IME even though logically it’s a tiny IME. On macOS, however, dead keys act like a tiny IME that’s driven by the declarative keyboard layout data as opposed to being bespoke for each language whose keyboard layouts use dead keys.
Keyboard layouts are a sufficient abstraction for most scripts, including ones whose display is considered complex.
On Windows, macOS, and Fedora IMEs are activated just like keyboard layouts. Fedora and macOS install the IMEs by default, and you add an IME to the input source selector the way you add a keyboard layout.
On macOS, this is done in the “Input Sources” tab of the “Keyboard” pane of “System Preferences”. There is no distinction in the selector between keyboard layouts and IMEs. All the ones with color icons are keyboard layouts, but the ones with grayscale icons can be keyboard layouts or IMEs.
On Fedora, this is done in the “Input Sources” section of the “Region & Language” pane of “Settings”. IMEs are distinguished from keyboard layouts by a two-gear icon to the right of their name.
On Ubuntu, the selector is otherwise the same as on Fedora, but IMEs don’t show up unless the corresponding language has been added first via the “Manage Installed Languages” button that is below the “Input Sources” section. Note that adding a language can install a set of fonts that are relevant to the language, so the combination of languages you have added is fingerprintable from the Web. Note that thanks to a Gnome bug, some input sources may be hidden unless a related glibc locale has been generated (even if not taken into use) on the system, so you should run sudo locale-gen zh_TW.UTF-8
first. (You can find this and other details on PinyinJoe.com.)
On Windows 10, both IMEs and keyboard layouts become available via adding a language to the system. This is done via the Add a language button in the “Region & Language” pane of “Windows Settings”. In the process, uncheck the box that offers to set the newly-added language as your Windows display language unless you want the UI language to change! As in the case of Ubuntu, adding a language to Windows may install additional language-relevant fonts, which makes the combination of languages you’ve added fingerprintable from the Web.
Beyond Fedora and Ubuntu (and, presumably, distros based on them), chances are that you are going to have a bad time with other distros. I timed out trying to figure out how to enable any IME on a U.S. English openSUSE installation. However, installing openSUSE in Japanese enabled a Japanese IBus-based IME, so I don’t expect openSUSE to provide any testing insight beyond Fedora and Ubuntu. Debian (Debian 9 at least) leaves setting up an IME as an exercise to the user even if the user installs the system selecting an IME-requiring language!
Firefox telemetry shows that most Linux IME users have use IBus IME framework (all Nightly Linux IME users use IBus) but some use fcitx. I timed out trying to figure out how one would end up with an fcitx configuration by default, so I didn’t test fcitx. Later, I read that at least at some point Ubuntu Kylin defaulted to fcitx, but I didn’t verify this. (It seems to me that at this point, avoiding IBus is like avoiding PulseAudio and systemd.)
Note that Windows won’t let you add more than one Traditional Chinese regional variant and one Simplified Chinese regional variant at a time. I don’t know what effect the region has in practice beyond which IME is offered as the default. Notably, choosing Hong Kong doesn’t reveal a Jyutping (phonetic Cantonese) IME. To enable non-default IMEs for Traditional Chinese and Simplified Chinese, click the language in the “Region & Language” pane, click “Options”, and click the “Add a keyboard” button.
Windows 10 simplified the input source selection UI. Especially if you install IMEs to Windows by other means or if you want to trigger Hanja conversion by mouse, you need to enable the legacy Language bar feature to get the old more complex input source selection UI. To enable the old Language bar, go to the “Typing” subpane of the “Devices” pane in “Windows Settings”, click “Advanced keyboard settings” and check the box “Use the desktop language bar when it’s available”.
Note that event though on all systems you switch between keyboard layouts and IMEs the same way (Gnome and macOS have a menu in the top right area in the menubar and Windows at bottom right area in the task bar), typically the CJK IMEs internally have a mode switch between a mode called either “English” or “Direct Input” (acting like a QWERTY keyboard layout) and the language that the IME is for. The “English” / “Direct Input” mode is typically denoted either by “A” or “英” in the UI, so that’s not the mode you want, but that’s the piece of UI to look for and click. The Chinese mode is typically denoted as “中”, the main (Hiragana plus Kanji) Japanese mode as “あ”, and the Korean mode as “한”. (In the context of the Microsoft Korean IME, the “漢” button is for Hanja conversion, so that’s not the button needed for this step.)
As far as I can tell, IMEs fall into three categories: ones that address a character repertoire that is too large to fit into the keyboard layout abstraction, one that moves display-time complexity to input time, and ones that are a matter of preference.
Hangul, the Korean script, has alphabet/syllabary duality. Logically, the script consists of alphabetic letters called jamo. However, each syllable is grouped into a block that occupies a square of the size of a Chinese character.
The jamo level of Hangul fits into a keyboard layout. The possible syllables don’t. For modern Hangul, given a word (separated by spaces in present-day Korean) consisting of a valid jamo sequence, the grouping into syllables is unambiguous.
If one considers how e.g. Brahmic scripts are rendered using contextual rendering-time glyph selection (shaping), one might expect a similar mechanism to work for modern Hangul. However, due to the relative timelines of IME and shaping technology development as well as wishing to fit archaic Hangul into the same general approach as modern Hangul in text storage, the syllable grouping is made explicit in text storage.
It is then the job of a Hangul IME to group jamo produced by a keystroke per alphabetic unit into explicitly-stored syllables. Since the grouping is unambiguous for modern Hangul, there’s no need for UI for explicitly guiding the grouping.
Let’s try typing 서울 (Seoul; this word has a syllable that ends in a vowel sound and another syllable that starts with a vowel sound, which is interesting as we’ll see below).
To figure out which key corresponds to which jamo, refer to a picture of the most common layout (known as 두벌식 / Dubeolsik / 2-Bulsik / 2-set Korean) on Wikimedia Commons.
US QWERTY keys typed | Jamo keys typed | Output | Notes |
---|---|---|---|
t | ㅅ | ㅅ | Not yet a valid syllable. |
tj | ㅅㅓ | 서 | First syllable is complete. Or is it? |
tjd | ㅅㅓㅇ | 성 | The first jamo of the second syllable is a plausible third jamo for the first syllable but isn’t a valid syllable alone. |
tjdn | ㅅㅓㅇㅜ | 서우 | The fourth jamo isn’t valid unless the third jamo becomes the first jamo of the second syllable. |
tjdnf | ㅅㅓㅇㅜㄹ | 서울 | The second syllable is complete. |
Each modern syllable starts with one consonant followed by one vowel optionally followed by one or two consonants, where ㅇ as the first consonant is a silent placeholder for when there is phonetically no leading consonant and jamo that are visually (not phonetically) double-consonants (e.g. ㅆ is visually a double form of ㅅ typed by pressing shift while pressing ㅅ) are analyzed as a single consonant for the purpose of the rule, keystrokes, and Unicode. As seen above, after entering one consonant and one vowel, the syllable is both plausible complete and incomplete. A following consonant can become the third jamo in the syllable. However, if a vowel follows that consonant, they have to form a new syllable.
Microsoft also ships an Old Hangul IME for writing archaic Hangul. I’m not covering it here.
The issue that there are more characters than fit on a keyboard arises with the Han, Yi, and Ge’ez scripts.
Microsoft bundles Ge’ez-script IMEs for Amharic and Tigrinya. Fedora and Ubuntu appear to ship only an Amharic IME, but a Web search suggests it might be suitable for also writing Tigrinya. macOS does not appear to come with Ge’ez text input support, but e.g. SIL has a third-party IME (which I didn’t test) that supports macOS and that supports more Ge’ez-script languages than the Windows and Fedora-bundled IMEs.
The Ge’ez script encodes a consonant, which may be glottal stop, and a vowel as one character such that the dominant shape of the character denotes the consonant and then the shape acquires smaller changes to denote the vowel. The consonants fit on a keyboard and so do the vowels. They map roughly to QWERTY keys with similar Latin-script phonetic values. To type a character, you type a vowel, which generates a character with with a glottal stop as the consonant, or the consonant and, if the default vowel isn’t the right one, also a vowel.
Let’s try typing አማርኛ, the name of the Amharic language in the language itself. Wikipedia says the romanization is Amarəñña. Then it’s easy to guess the keys.
US QWERTY keys typed | Output |
---|---|
a | አ |
am | አም |
ama | አማ |
amar | አማር |
amarN | አማርኝ |
amarNa | አማርኛ |
This particular example works the same on Gnome and Windows, but there appear to be some differences for some of the key mappings.
So how is this different from Hangul? In both cases, you type alphabetic keystrokes and get characters in the text buffer that group those keystrokes. In Hangul, the jamo have standalone notation, the jamo appear graphically identifiably in the clusters, and the clusters can be decomposed into their component jamo within Unicode. The Ge’ez characters do not decompose in Unicode or graphically even though they can be considered to decompose phonetically and in terms of keystrokes.
Also, compared to Hangul, it seems to me that Ge’ez could be handled by the dead key abstraction but gets an IME on Windows and Gnome in order to get visual feedback of the key that acts like the dead key, since those systems don’t provide visual feedback of dead keys. It seems to me that Amharic input could work as a keyboard layout on macOS.
Windows and Fedora come with a Yi-script IME for the Nuosu language. It appears that Ubuntu and macOS don’t.
Let’s try typing ꆈꌠ.
To write Yi syllables, you type in the standard romanization, which is unambiguous. You can look this up in the documentation of SIL Keyman. Gnome doesn’t require you to press space between syllables. Windows and Keyman do, so on Gnome, you type nuosu and on Windows nuo su (trailing space).
Han-script IMEs split into two main categories: ones that are based on the shape of each character and ones that map phonetic notation to Han characters using a dictionary. In the Japanese and Korean contexts, Han input is of the latter type. In the Chinese contexts, the most common input methods are as follows:
Shape | Phonetic | |
---|---|---|
Traditional | Cangjie | Bopomofo / Zhuyin |
Simplified | Wubi 86 | Pinyin |
These four, plus Quick / Sucheng, which is a simplified version of Cangjie, are the ones that Windows, Mac, and major Linux distros (and Android Gboard) have in common. There are also others, both bundled with a specific OS and available as third-party products. The phonetic methods may be configurable to flip the traditional/simplified output expectation relative to the table above. Check your IME settings if the phonetic method tests below give you simplified form when expecting traditional or vice versa.
For testing these, let’s use the word for a Han character, 漢字 in traditional form and 汉字 in simplified form, which contains a character that is obviously different in traditional and simplified forms and a character that does not have a separate simplified form.
Cangjie (called ChangJie on Windows) assigns a unique key stroke sequence for each supported character. The sequence is based on assigning a radical (Cangjie-specific radical; not the same as KangXi radicals) to each key and decomposing the characters into radicals. On Windows, if you wish to input Hong Kong-specific characters or other characters that were not part of code page 950 (Big5 without the HKSCS extension), you need to check some boxes in the IME settings.
Let’s try typing 漢字.
You can use Wiktionary lookup for the individual characters to figure out the Cangjie keystrokes in terms of Cangjie key caps and QWERTY key caps (漢: 水廿中人 / etlo, 字: 十弓木 / jnd). macOS comes with a nice palette that you can also use to do these lookups (available from the input method selector menu when Cangjie is active). The space bar ends a character without producing a space in the output.
US QWERTY keys typed | Cangjie keys typed | Output |
---|---|---|
etlo jnd (trailing space) | 水廿中人 十弓木 (trailing space) | 漢字 |
The Quick (as it’s called on Windows, Linux, and Android) or Sucheng (as it’s called on macOS) method involves typing the first and last keystroke of the Cangjie sequence for the desired character and then choosing from a menu.
US QWERTY keys typed | Cangjie keys typed | Output |
---|---|---|
eo8jd2 | 水人8十木2 | 漢字 |
The digits refer the position of the candidate character in the popup menu and depend on the implementation, on your personalized frecency, and potentially on context. In my case, the first character was the eigth in the popup and the second one was the second in the popup.
Bopomofo, also called Zhuyin, is a phonetic notation primarily for Mandarin whose characters are derived from Han characters. An IME of the same name (whether the name is Bopomofo or Zhuyin depends on the operating system) takes phonetic (in terms of Mandarin pronunciation) input as Bopomofo, which fits on a keyboard, and produces Han characters based on a dictionary lookup.
Let’s try typing 漢字 again.
This time, the Wiktionary lookup needs to be by the whole word. We find that the Bopomofo form is ㄏㄢˋ ㄗˋ. To figure out how these map to keys, let’s again look at a picture on Wikimedia Commons. A syllable ends with a tone or a space when there is no tone. Here ˋ is the tone, so we omit the space that Wiktionary includes after the first ˋ.
US QWERTY keys typed | Bopomofo keys typed | Output |
---|---|---|
c04y4 (return) | ㄏㄢˋㄗˋ (return) | 漢字 |
In this case, the dictionary lookup probably offers just one candidate, so we don’t need to use the down arrow key to choose a candidate and we can just commit the word using return.
Like Cangjie, Wubi 86 assigns a key stroke sequence to each supported character according to a radical-based decomposition, but Wubi limits the character sequence to up to four key strokes so that if the decomposition would result in more than four key strokes only the first three and the last one are used. Space ends composition for a given character.
Let’s try typing 汉字.
There’s a Web site with Wubi lookup tables, which you can search using DuckDuckGo by entering the character and site:wubi.free.fr as the search terms. The key sequence is in the title tooltip. Form there, we learn: 汉: ic, 字: pb.
US QWERTY keys typed | Output |
---|---|
ic pb (trailing space) | 汉字 |
Pinyin is a romanization system for Mandarin. In display form, it uses diacritics to indicate tone, but for IME use, you type without the diacritics.
Let’s try typing 汉字 again.
Wiktionary says that the pinyin form is hànzì.
US QWERTY keys typed | Output |
---|---|
hanzi1 | 汉字 |
In this case, the number key is not a tone but the position of the choices offered, which depends on what you’ve written previously (your personalized frecency). In this case, the IME offered what I wanted as the first choice, so I pressed the number key 1.
Despite Microsoft already shipping a Pinyin IME with Windows, it appears that publishing a Pinyin IME for Windows is a thing that search engine companies operating in China do. I gather these compete with Microsoft on dictionary coverage.
DaYi is conceptually similar to Wubi but for Traditional Chinese. It is bundled with Windows. Visually, it looks neglected by Microsoft in the transition to Windows 10 and looks like it is likely based on the same framework as the Chinese Array IME. I recall seeing a glitch (I forgot what exactly) that was specific to these two IMEs, which along with the neglected appearance made me suspect that these two IMEs might exercise the IME APIs in a different way from the other Chinese IMEs. (I didn’t verify this suspicion using logging or a debugger.) For this reason, it seems prudent to test this IME on Windows.
There’s a list of the input codes on GitHub.
US QWERTY keys typed | Output |
---|---|
xv mg (trailing space) | 漢字 |
Korean desktop IMEs provide a way of converting a word written in Hangul into Hanja (Han-script characters in Korean context) using a dictionary. However, since the usage is rare, unlike with phonetic Chinese IMEs or Japanese IMEs, you need to explicitly invoke the conversion. If you take a look at present-day Korean text, chances are that you’ll find only Hangul and if you occasionally find Han-script characters, they are or shorthand in newspaper headlines (e.g. 美 for the United States or 北 for North Korea) restatements in parentheses. Google’s Gboard for Android doesn’t even have the Hanja conversion feature. On the other hand, the Korean IME on macOS can be configured to generate restatements in parentheses instead of just converting a word from Hangul to Hanja.
Invoking this feature varies by system. On Windows, the key that on a US QWERTY keyboard is the right ctrl key triggers Hangul to Hanja conversion, but on Gnome, you might have to press F9 instead, and on Mac option-return. On Windows and Mac, the lookup is on a per-word basis, so the cursor should be at the end of a word when invoking the conversion. On Gnome, the conversion is on a per-syllable basis.
Let’s try typing 漢字. Note that these are the same Unicode characters as in the Traditional Chinese case, but the second glyph looks different (the tiny stroke at the very top is vertical) in fonts meant for Korean or Japanese (below) compared to Chinese (Simplified or Traditional).
Wiktionary says the Hangul form is 한자, which is according to the previously-mentioned layout picture is gkswk in terms of QWERTY keys.
On Gnome, you type ㅎㅏㄴ (gks), then press F9, then press arrow down until you find 漢, then press return, then type ㅈㅏ (wk), then press F9, and then press arrow down until you find 字, and then press return.
On Mac, you type ㅎㅏㄴㅈㅏ (gkswk), then press option-return, then press down arrow until you find 漢字, and then press return.
On Windows, you type ㅎㅏㄴㅈㅏ (gkswk), then press what on non-Korean keyboards is the ctrl key on the right side of the keyboard, then press down arrow until you find 漢字, and then press return.
An alternative way on Windows, which is worth testing, since it involves a unique UI gesture among IMEs (clicking some UI outside of the IME popup), is that, with the Windows legacy Language bar enabled, you type ㅎㅏㄴㅈㅏ (gkswk), then click the button labeled “漢” in the language bar, and then click 漢字 in the popup that shows up.
(Aside: Note that after typing ㅎㅏㄴㅈ but before typing the final ㅏ, the composition string shows 핝, a cluster that has two trailing consonants.)
Japanese IMEs convert Hiragana text to Kanji (Han-script characters in Japanese context). For example, if you’ve written the Japanese reading for 漢字 as かんじ (U.S. QWERTY key strokes tyd[), the IME offers to convert it into 漢字 by dictionary lookup as was previously seen in the Bopomofo, Pinyin, and Hangul-to-Hanja cases. Hiragana fits into a keyboard layout and you can configure Japanese IMEs such that each keystroke produces a Hiragana base character or a voicing mark directly with voicing marks immediately combining with their bases. Here じ is one Unicode scalar value produced by two key strokes: し, QWERTY d, and ゛, U.S. QWERTY left square bracket. Note that in legacy half-width Katakana the voicing marks remain as distinct characters in the text buffer. However, a Hiragana keyboard layout is not the default or, as I understand it, the popular configuration. (It is, though, the way Apple’s Ainu IME works out of the box.) Which brings us to the next category of IMEs.
(Updated with information about a Vietnamese Telex IME in Windows 10 1909 on 2020-02-14.)
There are cases where the keyboard layout abstraction is logically sufficient for a given script, but an IME is used nonetheless. The most notable ones are the step of getting from keystrokes to Hiragana in Japanese IMEs and writing Vietnamese using the Telex spelling. As I understand it, these are cases where the user community as a whole has a very strong preference towards IME rather than keyboard layout. Despite keyboard layouts being logically sufficient, IMEs exist for languages of India, but my understanding is that there isn’t a user community-wide IME preference.
Notably, as far as operating system vendors are concerned, there is a cut-off after Vietnamese. All vendors appear to agree that IMEs for Chinese (both traditional and simplified), Japanese, Korean, and Vietnamese are the must haves. However, it took a long time for Microsoft to bundle Vietnamese Telex IME. For languages used in India, what operating system vendors bundle does not appear to have a clear pattern or norm, and macOS lacks bundled Ge’ez and Yi IMEs. (There also appears to be a documentation cut-off such that it’s easy enough to find information in English about the CJK IMEs and the Vietnamese Telex IME and hard to find information in English about how to operate the rest.)
Despite Hiragana fitting into a keyboard layout in principle, the default configuration of Japanese IMEs is that in addition to the Hiragana to Kanji conversion layer (in the previous category), there is a Romaji to Hiragana conversion layer (in this category). Romaji means writing Japanese with Latin letters. As with Pinyin, the standards (of which there are multiple) use diacritical marks. IMEs, however, use ASCII-only notation. This gist was the best online reference I found, but the IME romanization is documented more fully in Ken Lunde’s book CJKV Information Processing (pages 304–306 in the second edition), and it appears to differ from the romanization standards in ways that are more than a matter of just omitting the diacritics from any one of the standards. On Ubuntu and Windows, the conversion table is viewable and editable in the settings of the Japanese IME.
Let’s try typing 漢字.
Wiktionary tells us that the romaji form is kanji.
US QWERTY keys typed | Output |
---|---|
k | k |
ka | か |
kan | かn |
kanj | かんj |
kanji | かんじ |
kanji (trailing space or return, depending on system, potentially after down arrows) | 漢字 |
Although Japanese IMEs have (on desktop) one dominant design, there are multiple implementations. Notably, despite using the same IME API (IBus), Ubuntu and Fedora ship different implementations, so it’s probably prudent to test both. Google ships a proprietary but gratis IME for Japanese for Windows. (As I understand it, the code is Open Source and shipped also on Ubuntu, but the dictionary is proprietary and is the dictionary distinguishes the IME from the one that Microsoft bundles with Windows.)
There’s a proprietary product called ATOK that appears to be popular. (As I understand it, the distinguishing feature of ATOK is that you can have cloud sync for your personal dictionary and word frecency across multiple computers and operating systems.) The ATOK code base is very old and it shows in technically interesting ways. In particular, it uses some non-Unicode Windows APIs and it has a character palette that isn’t a palette for Windows window management purposes, which means that the character palette takes the place of the application window as the Windows active window, and the character palette ends up sending text input to an application window that isn’t active for Window management purposes, even though conceptually the active window is the window that receives text input. This can lead to interesting effects.
Japanese IMEs also support after-the-fact conversion actions similar to Hanja conversion in the Korean case. (Thanks to Masayuki Nakano for pointing these out to me.) Since these action involve the IME querying the app for text that’s already there, these actions can expose bugs that normal composition operations don’t.
Immediately after committing a word, you can uncommit it by pressing ctrl-backspace (cmd-backspace on Mac).
More generally, you can request reconversion for the word around the text insertion point (i.e. the IME figures out what constitutes a word around the text insertion point) or for an explicit text selection. On Mac this appears to be triggered by both ctrl-shift-r and option-shift-r. I don’t know if there’s a subtle difference between the two. On Windows and Linux, this operation is by default bound to a key that doesn't exist on non-Japanese keyboards, so to test with a non-Japanese keyboard you need to go in the IME preferences, look for a reconversion action bound to the “Henkan” key and change the key binding to something else.
When Vietnamese tones are treated as separate key strokes, Vietnamese fits into a keyboard layout, and Vietnam indeed has a keyboard layout standard like this. The Vietnamese keyboard layout is unusual, however, in the sense that the stream of text that it produces is unnormalized in Unicode terms. Unlike in French, where accented characters for which there is no dedicated key are produced using dead keys by typing the accent first, Vietnamese tones are typed after the base character and this produces Unicode combining mark scalar values without IME post-processing in contrast to Hiragana voicing marks. However, some of the Vietnamese base characters are precomposed characters for Unicode purposes. Hence, the text stream produced by the standard keyboard layout is neither in Normalization Form C (because the tones are decomposed) nor in Normalization form D (because the bases are precomposed).
An alternative, which I understand to be much more popular than the standard keyboard layout, is typing Vietnamese using Telex spelling and having an IME convert the ASCII-only Telex spelling to the official spelling (in Unicode Normalization Form C). This is analogous to if German was written such that the user typed two letters oe to have an IME convert them into ö.
(Updated with information about a Vietnamese Telex IME in Windows 10 1909 on 2020-02-14.) Although Windows 10 1909 ships a Telex IME for Vietnamese and defaults to it, for a long time Microsoft did not ship a Telex IME for Vietnamese and only shipped the standard keyboard layout. For earlier versions of Windows, you need to download and install a third-party Telex IME called UniKey. (Be sure not to download it with malware added from sites other than the linked official site.) Unfortunately, UniKey does not integrate with the Windows system input method switcher and instead has its own on/off UI.
In contrast, Google’s Gboard for Android only has the Telex IME for Vietnamese and doesn’t ship the standard keyboard layout. Ubuntu, Fedora, and macOS ship both.
Let’s try typing Tiếng Việt (the name of the Vietnamese language in Vietnamese).
The rules are on Wikipedia.
US QWERTY keys typed | Output |
---|---|
Tieengs Vieetj | Tiếng Việt |
As with modern Hangul, the rules are unambiguous, so there is no UI for explicitly guiding the transformation.
(Updated with information about Windows 1909 on 2020-02-14.)
As noted, languages used in India each fit into a keyboard layout (InScript is a family of federal government keyboard layouts, but Tamil Nadu also has a different state government standard for Tamil: Tamil 99) and there appears to be no established cross-platform practice of which languages also get an IME as an alternative. The way the IMEs work is that you type in some phonetic romanization. The Windows 8-era IMEs and Gnome, there appear to be strict rules as with Vietnamese Telex input, so that there’s no need for additional UI beyond the keystrokes themselves. I failed to figure out what those rules are, though! On Mac, Windows 10 1909, and Android Gboard, the behavior is closer to the way a Pinyin IME works: The input is approximate and you choose from options. It also seems to me that for a given system, the reverse Latin transliteration IMEs for the languages of India are all built on the same framework, so it seems that it’s sufficient to test one per system. I tested Bengali on Windows, Hindi on macOS, and Tamil on Fedora.
For earlier versions of Windows 10, the IMEs don’t install the same way as CJK IMEs. Rather, you need to download and install them separately. The downloads for Windows 8 in the Indic Input 3 section on Microsoft’s site work on Windows 10. However, you need to have the legacy Language bar enabled to switch to the IMEs. To test Bengali, I added the language Bangla (India) to Windows the usual way and then installed Bengali Indic Input 3. Subsequently, I was able to switch to the IME using the Language bar. Windows 10 1909 ships new IMEs that install the same way as CJK IMEs (but the InScript-family layouts are still the defaults).
For Tamil and Bengali, I failed to figure out how to type the name of the language itself, but it turns out that the word for “father” in both languages was easy enough to figure out how to type.
Language | System | Input | Output |
---|---|---|---|
Bengali | Windows 10 1803 | pitaa | পিতা |
Tamil | Fedora | appaa | அப்பா |
Hindi | macOS | hindee right arrow, right arrow, return | हिन्दी |
IMEs have the notion of a composition string. This is the part of the text that is rendered by the application but that still isn’t “done” and that the IME still expects subsequent keystrokes to change. Typically, the composition string is underlined, though for Hangul the convention appears to be highlighting the whole glyph of the current syllable. At some point, the composition gets committed. In addition to the primary commit action, a space keystroke, return keystoke, number key keystroke, or typing the vowel of the next Hangul syllable, the composition may need to be committed in response to the user clicking something that blurs the text field that had an uncommitted composition in it. Obviously, this is an opportunity for bugs.
Therefore, it’s worthwhile to test that the composition gets committed if the next action is something that makes use of the string, such as pressing the submit button on a form. With IMEs that have a popup, you should try this with the popup open. With Korean and Vietnamese IMEs, you should test syllables that could still gain a trailing consonant in Korean or a tone in Vietnamese.
Also, you should try opening menus from the menubar both by mouse and by keyboard with an unfinished composition open.
If the application supports arbitrary geometric transformations (scale, rotate, translate) of content, as a Web browser does, it’s a good idea to test that IMEs that show a popup end up showing the popup in the right place when the text insertion point is within transformed content.
Is it really worthwhile to test all of these IMEs? Which ones exercise the IME APIs so similarly that they are mutually equivalent for purposes of testing API interaction?
I don’t know, and before trying them all, I couldn’t have guessed what I ended up seeing. In particular, the needs of Korean IMEs appear simpler than the needs Japanese and Chinese IMEs, so I expected Korean IMEs expose fewer bugs. Yet, I saw bugs that initially appeared Korean-specific (a couple in Firefox and one, on Ubuntu, in the IME). Part of this is due to Hanja conversion being an after-the-fact operation rather than part of writing the word initially. Later I learned that it was possible to reproduce a bug that looked Hanja conversion-specific by requesting undo or reconversion from Japanese IMEs. In retrospect, it seems to me that testing OS-bundled Japanese on each OS (including the undo and reconversion features), DaYi on Windows, and peripheral palette features of ATOK on Windows would have given enough coverage for my needs in 2019, but chances are your set of bugs is revealed by a different minimal set of IMEs. Testing ATOK character palette is not something that one thinks of as a matter of superficially trying every IME: The ATOK character palette issue was reported by a user. Also, in the case of reconversion bugs, I was able to think of the required actions on my own in the Korean case but needed advice on what to try in the Japanese case.
Even though the list looks long, iterating through each of the system-bundled IMEs on Windows 10, macOS, and Ubuntu plus checking the IMEs on Fedora that differ from those provided by Ubuntu doesn’t really take that much time with the hints of what to try listed above, so I encourage you to go through the whole list. You could sink a lot of time into installing third-party IMEs though. Still, it’s probably a good idea to test at least UniKey for Vietnamese on Windows.