It’s Not Wrong that "🤦🏼‍♂️".length == 7

But It’s Better that "🤦🏼‍♂️".len() == 17 and Rather Useless that len("🤦🏼‍♂️") == 5

From time to time, someone shows that in JavaScript the .length of a string containing an emoji results in a number greater than 1 (typically 2) and then proceeds to the conclusion that haha JavaScript is so broken—and is rewarded with many likes. In this post, I will try to convince you that ridiculing JavaScript for this is less insightful than it first appears and that Swift’s approach to string length isn’t unambiguously the best one. Python 3’s approach is unambiguously the worst one, though.

What’s Going on with the Title?

"🤦🏼‍♂️".length == 7 evaluates to true as JavaScript. Let’s try JavaScript console in Firefox:

"🤦🏼‍♂️".length == 7
true

Haha, right? Well, you’ve been told that the Python community suffered the Python 2 vs. Python 3 split, among other things, to Get Unicode Right. Let’s try Python 3:

$ python3
Python 3.6.8 (default, Jan 14 2019, 11:02:34) 
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> len("🤦🏼‍♂️") == 5
True
>>> 

OK, then. Now, Rust has the benefit of learning from languages that came before it. Let’s try Rust:

$ cargo new -q length
$ cd length
$ echo 'fn main() { println!("{}", "🤦🏼‍♂️".len() == 17); }' > src/main.rs
$ cargo run -q
true

That’s better!

What?

The string contains a single emoji consisting of five Unicode scalar values:

Unicode scalarUTF-32 code unitsUTF-16 code unitsUTF-8 code unitsUTF-32 bytesUTF-16 bytesUTF-8 bytes
U+1F926 FACE PALM124444
U+1F3FC EMOJI MODIFIER FITZPATRICK TYPE-3124444
U+200D ZERO WIDTH JOINER113423
U+2642 MALE SIGN113423
U+FE0F VARIATION SELECTOR-16113423
Total5717201417

The string that contains one graphical unit consists of 5 Unicode scalar values. First, there’s a base character that means a person face palming. By default, the person would have a cartoonish yellow color. The next character is an emoji skintone modifier the changes the color of the person’s skin (and, in practice, also the color of the person’s hair). By default, the gender of the person is undefined, and e.g. Apple defaults to what they consider a male appearance and e.g. Google defaults to what they consider a female appearance. The next two scalar values pick a male-typical appearance specifically regardless of font and vendor. Instead of being an emoji-specific modifier like the skin tone, the gender specification uses an emoji-predating gender symbol (MALE SIGN) explicitly ligated using the ZERO WIDTH JOINER with the (skin-toned) face-palming person. (Whether it is a good or a bad idea that the skin tone and gender specifications use different mechanisms is out of the scope of this post.) Finally, VARIATION SELECTOR-16 makes it explicit that we want a multicolor emoji rendering instead of a monochrome dingbat rendering.

Each of the languages above reports the string length as the number of code units that the string occupies. Python 3 strings store Unicode code points each of which is stored as one code unit by CPython 3, so the string occupies 5 code units. JavaScript (and Java) strings have (potentially-invalid) UTF-16 semantics, so the string occupies 7 code units. Rust strings are (guaranteed-valid) UTF-8, so the string occupies 17 code units. We’ll come to back to the actual storage as opposed to semantics later.

Note about Python 3 added on 2019-09-09: Originally this article claimed that Python 3 guaranteed UTF-32 validity. This was in error. Python 3 guarantees that the units of the string stay within the Unicode code point range but does not guarantee the absence of surrogates. It not only allows unpaired surrogates, which might be explained by wishing to be compatible with the value space of potentially-invalid UTF-16, but Python 3 allows materializing even surrogate pairs, which is a truly bizarre design. The previous conclusions stand with the added conclusion that Python 3 is even more messed up than I thought! With the way the example string was constructed in Python 3, the Python 3 string happens to match the valid UTF-32 representation of the string, so it is still illustrative of UTF-32, but the rest of the article has been slightly edited to avoid claiming that Python 3 used UTF-32.

But I Want the Length to Be 1!

There’s a language for that. The following used Swift 4.2.3, which was the latest release when I was researching this, on Ubuntu 18.04:

$ mkdir swiftlen
$ cd swiftlen/
$ swift package init -q --type executable
$ swift package init --type executable
Creating executable package: swiftlen
Creating Package.swift
Creating README.md
Creating .gitignore
Creating Sources/
Creating Sources/swiftlen/main.swift
Creating Tests/
Creating Tests/LinuxMain.swift
Creating Tests/swiftlenTests/
Creating Tests/swiftlenTests/swiftlenTests.swift
Creating Tests/swiftlenTests/XCTestManifests.swift
$ echo 'print("🤦🏼‍♂️".count == 1)' > Sources/swiftlen/main.swift 
$ swift run swiftlen 2>/dev/null
true

(Not using the Swift REPL for the example, because it does not appear to accept non-ASCII input on Ubuntu! Swift 5.0.3 prints the same and the REPL is still broken.)

OK, so we’ve found a language that thinks the string contains one countable unit. But what is that countable unit? It’s an extended grapheme cluster. (“Extended” to distinguish from the older attempt at defining grapheme clusters now called legacy grapheme clusters.) The definition is in Unicode Standard Annex #29 (UAX #29).

The Lengths Seen So Far

We’ve seen four different lengths so far:

Given a valid Unicode string and a version of Unicode, all of the above are well-defined and it holds that each item higher on the list is greater or equal than the items lower on the list.

One of these is not like the others, though: The first three numbers have an unchanging definition for any valid Unicode string whether it contains currently assigned scalar values or whether it is from the future and contains unassigned scalar values as far as software written today is aware. Also, computing the first three lengths does not involve lookups from the Unicode database. However, the last item depends on the Unicode version and involves lookups from the Unicode database. If a string contains scalar values that are unassigned as far as the copy of the Unicode database that the program is using is aware, the program will potentially overcount extended grapheme clusters in the string compared to a program whose copy of the Unicode database is newer and has assignments for those scalar values (and some of those assignments turn out to be combining characters).

More Than One Length per Programming Language

It is not the case that a given programming language has to choose only one of the above. If we run this Swift program:

var s = "🤦🏼‍♂️"
print(s.count)
print(s.unicodeScalars.count)
print(s.utf16.count)
print(s.utf8.count)

it prints:

1
5
7
17

Let’s try Rust with unicode-segmentation = "1.3.0" in Cargo.toml:

use unicode_segmentation::UnicodeSegmentation;

fn main() {
	let s = "🤦🏼‍♂️";
	println!("{}", s.graphemes(true).count());
	println!("{}", s.chars().count());
	println!("{}", s.encode_utf16().count());
	println!("{}", s.len());
}

The above program prints:

2
5
7
17

That’s unexpected! It turns out that unicode-segmentation does not implement the latest version of the Unicode segmentation rules, so it gives the ZERO WIDTH JOINER generic treatment (break right after ZWJ) instead of the newer refinement in the emoji context.

Let’s try again, but this time with unic-segment = "0.9.0" in Cargo.toml:

use unic_segment::Graphemes;

fn main() {
	let s = "🤦🏼‍♂️";
	println!("{}", Graphemes::new(s).count());
	println!("{}", s.chars().count());
	println!("{}", s.encode_utf16().count());
	println!("{}", s.len());
}
1
5
7
17

In the Rust case, strings (here mere string slices) know the number of UTF-8 code units they contain. The len() method call just returns this number that has been stored since the creation of the string (in this case, compile time). In the other cases, what happens is the creation of an iterator and then instead of actually examining the values (string slices correspoding to extended grapheme clusters, Unicode scalar values or UTF-16 code units) that the iterator would yield, the count() method just consumes the iterator and returns the number of items that were yielded by the iteration. The count isn’t stored anywhere on the string (slice) afterwards. If we wanted to later know the counts again, we’d have to iterate over the string again.

Know in Advance or Compute When Needed?

This introduces a notable question in the design space: Should a given type of length quantity be eagerly computed when the string is created? Or should the length be computed when someone asks for it? Or should it be computed when someone asks for it and then automatically stored on the string object so that it’s available immediately if someone asks for it again?

The answer Rust has is that the length in the code units of the Unicode Encoding Form of the language is stored upon string creation, and the rest are computed when someone asks for them (and then forgotten and not stored on the string).

Swift is a higher-level language and doesn’t document the exact nature of its string internals as part of the API contract. In fact, the internal representation of Swift strings changed substantially between Swift 4.2 and Swift 5.0. It’s not documented if different views to the string are held onto once created, for example. The documentation does say that strings are copy-on-write, so the first mutation may involve copying the string’s storage.

Notably, the design space includes not remembering anything. The C programming language is a prominent example of this case. C strings don’t even remember their number of code units. To find out the number of code units, you have to iterate over the string until a sentinel value. In the case of C, the sentinel is the code unit for U+0000, so it excludes one Unicode scalar value from the possible string contents. However, that’s not a strictly necessary property of a sentinel-based design that doesn’t remember any lengths. 0xFF does not occur as a code unit in any valid UTF-8 string and 0xFFFFFFFF does not occur in any valid UTF-32 string, so they could be used as sentinels for UTF-8 and UTF-32 storage, respectively, without excluding a scalar value from the Unicode value space. There is no 16-bit value that never occurs in a valid UTF-16 string. However, a valid UTF-16 string does not contain unpaired surrogates, so an unpaired low surrogate could, in principle, be used as a sentinel in a design that wanted to use guaranteed-valid UTF-16 strings that don’t remember their code unit length.

Knowing the Storage-Native Code Unit Length is Extremely Reasonable

The length of the string as counted in code units of its storage-native Unicode Encoding Form (i.e. whichever of UTF-8, UTF-16, and UTF 32 the programming language has chosen for its string semantics) is not like the other lengths. It is the length that the implementation cannot avoid having to know at the time of creating a new string, because it is the length that is required to be known in order to be able to allocate storage for a string. Even C, which promptly forgets about the code unit length in the storage-native Unicode Encoding Form after string has been created, has to know this length when allocating storage for a new string.

That is, the design decision is about whether to remember this length. It is not about whether to compute it eagerly. You just have to have it at string creation time—i.e. eagerly.

Considering that remembering this quantity makes string concatenation, which is a common operation, substantially faster to implement compared to not remembering this quantity, remembering this quantity is fundamentally reasonable. Also, it means that you don’t need to maintain a sentinel value, which means that a substring operation can yield results that share the buffer with the original string instead of having to copy in order to be able to insert sentinel. (Note that you can easily foil this benefit if you wish to eagerly maintain zero-termination for the sake of C string compatibility.)

What About Knowing the Other Lengths?

Even if we’ve established that it makes sense for string implementation to remember the storage length of the string in code units all the storage-native Unicode encoding form, it doesn’t answer whether a string implementation should also remember other lengths or which kind of length should be offered in the most ergonomic API. (As we see above, Swift makes the number of extended grapheme clusters more ergonomic to obtain that the code unit or scalar value length.)

Also, if any other length is to be remembered, there is the question of whether it should be eagerly computed as string creation time or lazily computed the first time someone asks for it. It is easy to see why at least the latter does not make sense for multi-threaded systems-programming language like Rust. If some properties of an object are lazily initialized, in a multi-threaded case you also need to solve synchronization of these computations. Furthermore, you need to allocate space at least for a pointer to auxiliary information if you want to be able to add auxiliary information later or you need to have a hashtable of auxiliary information where the string the information is about is the key, so auxiliary information, even when not present, has storage implications or implications of having to have global state in a run-time system. Finally, for systems programming, it may be more desirable to know the time complexity of a given operation clearly even if it means “always O(n)” instead of “possibly O(n) but sometimes O(1)”. Even if the latter looks strictly better, it is less predictable.

For a higher-level language, arguments from space requirements or synchronization issues might not be decisive. It’s more relevant to consider what a given length quantity is used for. This is often forgotten in Internet debates that revolve around what length is the most “correct” or “logical” one. So for the lengths that don’t map to the size of storage allocation, what are they good for?

It turns out that in the Firefox code base there are two places where someone wants to know the number of Unicode scalar values in a string that is not being stored as UTF-32 and attention is not paid to what the scalar values actually are. The IETF specification for Session Traversal Utilities for NAT (STUN) used for WebRTC has the curious property that it places length limits on certain protocol strings such that the limits are expressed as number of Unicode scalar values but the strings are transmitted in UTF-8. Firefox validates these limits. (The limit looks like an arbitrary power-of-two (128 scalar values). The spec has remarks about the possible resulting byte length, which was wrong according to the IETF UTF-8 RFC that was current and already nearly five years old at the time of publication of the STUN RFC. Specifically, the STUN RFC repeatedly says that 128 characters as UTF-8 may be as long as 763 bytes. To arrive at that number, you have to assume that a UTF-8 character can be up to six bytes long, as opposed to up to 4 bytes long as in the prevailing UTF-8 RFC and in the Unicode Standard, and that the last character of the 128 is a zero terminator and, therefore, known to take just one byte.) In this case, the reason for wishing to know a non-storage length is to impose a limit. The other case is reporting the column number for the source location of JavaScript errors.

Length limits, which we’ll come back to, probably aren’t a frequent enough a use case to justify making strings know a particular kind of length as opposed to such length being possible to compute when asked for. Neither are error messages.

Another use case for asking for a length is iterating by index and using the length as the loop termination condition 1990s Java style. Like this:

for (int i = 0; i < s.length(); i++) {
    // Do something with s.charAt(i)
}

In this case, it’s actually important for the length to be precomputed number on the string object. This use case is coupled with the requirement that indexing into the string to find the nth unit corresponding to the count of units that the “length” represents should be a fast operation.

The above pattern is a lot less conclusive in terms of what lengths should be precomputed (and what the indexing unit should be) than it first appears. The above loop doesn’t do random access by index. It sequentially uses every index from zero up to, but not including, length. Indeed, especially when iterating over a string by Unicode scalar value, typically when you examine the contents of a string, you iterate over the string in order. Programming languages these days provide an iterator facility for this, and e.g. to iterate over a UTF-8 string by scalar value, the iterator does not need to know the number of scalar values up front. E.g. in Rust, you can do this in O(n) time despite string slices not knowing their number of Unicode scalar values:

for (c in s.chars()) {
    // Do something with c
}

(Note that char is an 8-bit code unit (possibly UTF-8 code unit) in C and C++, char is a UTF-16 code unit in Java, char is a Unicode scalar value in Rust, and Character is an extended grapheme cluster in Swift.)

A programming language together with its library ecosystem should provide iteration over a string by Unicode scalar value and by extended grapheme cluster, but it does not follow that strings would need to know the scalar value length or the extended grapheme cluster length up front. Unlike the code unit storage length, those quantities aren’t useful for accelerating operations like concatenation that don’t care about the exact content of the string.

Which Unicode Encoding Form Should a Programming Language Choose?

The observation that having strings know their code unit length in their storage-native Unicode encoding form is extremely reasonable does not answer how many bits wide the code units should be.

The usual way to approach this question is to argue that UTF-32 is the best, because it provides O(1) indexing by “character” in the sense of a character meaning a Unicode scalar value, or the argument focuses on whether UTF-8 is unfair to some languages relative to UTF-16. I think these are bad ways to approach this question.

First of all, the argument that the answer should be UTF-32 is bad on two counts. First, it assumes that random access scalar value is important, but in practice it isn’t. It’s reasonable to want to have a capability to iterate over a string by scalar value, but random access by scalar value is in the YAGNI department. Second, arguments in favor of UTF-32 typically come at a point where the person making the argument has learned about surrogate pairs in UTF-16 but has not yet learned about extended grapheme clusters being even larger things that the user perceives as unit. That is, if you escape the variable-width nature of UTF-16 to UTF-32, you pay by doubling the memory requirements and extended grapheme clusters are still variable-width.

I’ll come back to the length fairness issue later, but I think a different argument is much more relevant in practice for the choice of in-memory Unicode encoding form. The more relevant argument is this: Implementations that choose UTF-8 actually accept the UTF-8 storage requirements. When wider-unit semantics are chosen for a language that doesn’t provide raw memory access and, therefore, has the opportunity to tweak string storage, the implementations try to come up with ways to avoid actually paying the cost of the wider units in some situations.

JavaScript and Java strings have the semantics of potentially-invalid UTF-16. SpiderMonkey and V8 implement an optimization for omitting the leading zeros of each code unit in a string, i.e. storing the string as ISO-8859-1 (the actual ISO-8859-1, not the Web notion of “ISO-8859-1” as a label of windows-1252), when all code units in the string have zeros in the most-significant half. The HotSpot JVM also implements this optimization, though enabling it is optional. Swift 4.2 implements a slightly different variant of the same idea, where ASCII-only strings are stored as 8-bit units and everything else is stored as UTF-16. CPython since 3.3 makes the same idea three-level with code point semantics: Strings are stored with 32-bit code units if at least one code point has a non-zero bit above the low 16 bits. Else if a string has a non-zero bits above the low 8 bits for at least one code point, the string is stored as 16-bit units. Otherwise, the string is stored as 8-bit units (Latin1).

I think the unwillingness of implementations of languages that have chosen UTF-16 or UTF-32 (or UTF-32-ish as in the case of Python 3) string semantics to actually use UTF-16 or UTF-32 storage when they can get away with not using actual UTF-16 or UTF-32 storage is the clearest indictment against UTF-16 or UTF-32 (and other wide-unit semantics like what Python 3 uses).

Languages that choose UTF-8, on the other hand, stick to actual UTF-8 for the purpose of storing Unicode scalar values. When languages that choose UTF-8 deviate from UTF-8, they do so in order to represent values that are not Unicode scalar values for compatibility with external constraints. Rust uses a representation called WTF-8 for file system paths on Windows. All UTF-8 strings are WTF-8 strings, but WTF-8 can also represent unpaired surrogates for compatibility with Windows file paths being sequences of 16-bit units that can contain unpaired surrogates. Perl 6 uses an internal representation called UTF-8 Clean-8 (or UTF8-C8), which represents strings that consist of Unicode scalar values in Unicode Normalization Form C the same way as UTF-8 but represents non-NFC content differently and can represent sequences of bytes that are not valid UTF-8.

UTF-8 is the only one of the Unicode encoding forms that is also a Unicode encoding scheme, and of the Unicode encoding schemes, UTF-8 has clearly won for interchange. (Unicode encoding forms are what you have in RAM, so UTF-16 consists of native-endian, two-byte-aligned 16-bit code units. Unicode encoding schemes are what can be used for byte-oriented interchange, so e.g. UTF-16LE consist of 8-bit code units every pair of which form a potentially-unaligned little-endian 16-bit number, which in turn may form a surrogate pair.) When UTF-8 is used as the in-RAM representation, input and output operations are less expensive than with UTF-16 or UTF-32. UTF-16 or UTF-32 in RAM requires conversion from UTF-8 when reading input and conversion to UTF-8 when writing output. A system that guarantees UTF-8 validity internally, such as Rust, needs only to validate UTF-8 upon reading input and no conversion is needed when writing output. (Go takes a garbage in, garbage out approach to UTF-8: input is not validated at input time and output is written without conversion. However, iteration by scalar value can yield REPLACEMENT CHARACTERs when iterating over invalid UTF-8. That is, the input step is less expensive than in Rust, but iterating by scalar value is marginally more expensive. The output step is less correct.)

Finally, in terms of nudging developers to write correct code, UTF-8 has the benefit of being blatantly variable-width, so even with languages such as English, Somali, and Swahili, as soon as you have a dash or a smart quote, the variable-width nature of UTF-8 shows up. In this context, extended grapheme clusters are just extending the variable-width nature. Meanwhile, UTF-16 allows programmers to get too far while pretending to be working with something where the units they need to care about are fixed-width. Reacting to surrogate pairs by wishing to use UTF-32 instead is a bad idea, because if you want to write correct software, you still need to deal with variable-width extended grapheme clusters.

The choice of UTF-32 (or Python 3-style code point sequences) arises from wanting the wrong thing. The choice of UTF-16 is a matter of early-adopter legacy from the time when Unicode was expected to be capped to 16 bits of code space and, once UTF-16 has been committed to, not breaking compatibility with already-written programs is important and justified the continued use of UTF-16, but if you aren’t bound by that legacy and are designing a new language, you should go with UTF-8. Occasionally even systems that appear to be bound by the UTF-16 legacy can break free. Even though Swift is committed to interoperability with Cocoa, which uses UTF-16 strings, Swift 5 switched to UTF-8 for Swift-native strings. Similarly, PyPy has gone UTF-8 despite Python 3 having code point semantics.

Shouldn’t the Nudge Go All the Way to Extended Grapheme Clusters?

Even if we accept that the storage should be UTF-8 and that the string implementation should maintain knowledge of the string length in UTF-8 code units, if the blatant variable-widthness of UTF-8 is argued to be a nudge toward dealing with the variable-widthness of extended grapheme clusters, shouldn’t the Swift approach of making extended grapheme cluster access and count the view that takes the least ceremony to use be the thing that every language should do?

Swift is still too young to draw definitive conclusions from. It’s easy to believe that the Swift approach nudges programmers to write more extended grapheme cluster-correct code and that the design makes sense for a language meant primarily for UI programming on a largely evergreen platform (iOS). It isn’t clear, though, that the Swift approach is the best for everyone.

Earlier, I said that the example used “Swift 4.2.3 on Ubuntu 18.04”. The “18.04” part is important! Swift.org ships binaries for Ubuntu 14.04, 16.04, and 18.04. Running the program

var s = "🤦🏼‍♂️"
print(s.count)
print(s.unicodeScalars.count)
print(s.utf16.count)
print(s.utf8.count)

in Swift 4.2.3 on Ubuntu 14.04 prints:

3
5
7
17

So Swift 4.2.3 on Ubuntu 18.04 as well as the unic_segment 0.9.0 Rust crate counted one extended grapheme cluster, the unicode-segmentation 1.3.0 Rust crate counted two extended grapheme clusters, and the same version of Swift, 4.2.3, but on a different operating system version counted three extended grapheme clusters!

Swift 4 delegates Unicode segmentation to operating system-provided ICU, and “Long-Term Support” in the Ubuntu case means security patches but does not mean rolling forward the Unicode version that the system copy of ICU knows about. In the case of iOS, delegating to system ICU is probably OK and will not lead to too high probability of the text being from the future from the point of view of the OS copy of ICU, since the iOS ecosystem stays exceptionally well up-to-date. However, delegating to system ICU is not such a great match for the idea of using Swift on the server side if the server side means running an old LTS distro.

(Swift 5 appears to no longer use system ICU for this. That is, Swift 5.0.3 on Ubuntu 14.04 sees one extended grapheme cluster in the string. I haven’t investigated what Swift 5 uses, but I assume that the switch to UTF-8 string representation necessitated using something other than ICU, which is heavily UTF-16-oriented. However, the result with Swift 4.2.3 nicely illustrates the issue related to using extended grapheme clusters.)

If you are doing things that have to be extended grapheme cluster-aware, there just is no way around the issue of not being able to correctly segment text that comes from the future relative to the Unicode segmentation implementation that your program is using. This is not a reason to avoid extended grapheme clusters for tasks that require awareness of extended grapheme clusters.

However, pushing extended grapheme clusters onto tasks that do not really require the use of extended grapheme cluster introduces failure modes arising from the Unicode version dependency where such a dependency isn’t strictly necessary. For example, the Unicode version dependency of extended grapheme clusters means that you should never persist indices into a Swift strings and load them back in a future execution of your app, because an intervening Unicode data update may change the meaning of the persisted indices! The Swift string documentation does not warn against this.

You might think that this kind of thing is a theoretical issue that will never bite anyone, but even experts in data persistence, the developers of PostgreSQL, managed to make backup restorability dependent on collation order, which may change with glibc updates.

Let’s consider other languages a bit.

C++ is often deployed such that the application developer doesn’t ship the standard library with the program. Most obviously, relying on GNU libstdc++ provided by an LTS Linux distribution presents similar problems as Swift 4 relying on ICU provided by an LTS Linux distribution. This isn’t a Linux-specific issue. Old supported branches of Windows generally don’t get new system-level Unicode data, either. Even though there is some movement towards individual applications shipping their own copy of LLVM libc++ with the application and the increased pace of C++ standard development starting with C++11 has made using a system-provided C++ standard library more problematic even ignoring Unicode considerations, it doesn’t seem like a good idea for C++ to develop a tight coupling with extended grapheme clusters for operations that don’t strictly necessitate it as longs as stuck-in-the-past system libraries (whether the C++ standard library itself or another library that it delegates to) are a significant part of the C++ standard library distribution practice.

There’s a proposal to expose extended grapheme cluster segmentation to JavaScript programs. The main problem with this proposal is the implication on APK sizes on Android and the effect of APK sizes on browser competition on Android. But if we ignore that for the moment and imagine this was part of the Web Platform, it would still be problematic to build this dependency into operations for which working on extended grapheme clusters isn’t strictly necessary. While the most popular browsers are evergreen, there’s still a long tail of browser instances that aren’t on the latest engine versions. When JavaScript executes on such browsers, there’d be effects similar to running Swift 4 on Ubuntu 14.04.

In contrast to C++ or JavaScript, the current Rust approach is to statically link all Rust library code, including the standard library, into the executable program. This means that the application distributor is in control of library versions and doesn’t need to worry about the program executing in the context of out-of-date Rust libraries. The flip side is concerns about the size of the executable. People already (rightly or wrongly) complain about the sizes of Rust executables. Pulling in a lot of Unicode data due to baking extended grapheme cluster processing into programs whose problem domain doesn’t strictly require working with extended grapheme clusters would be problematic in embedded contexts where the executable size is a real problem and not just a perceived problem—and would obviously make the perceived problem worse, too. Furthermore, in order to avoid problems similar to those involved in relying on system libraries, baking tight coupling with Unicode data into the standard library necessitates the organizational capability of keeping up with new Unicode versions in this area where not only data in the tables keeps changing but the format of the tables and, therefore, the associated algorithms have still been changing recently. Right now of the two extended grapheme cluster crates outside the Rust standard library, the one that’s organizationally closer to the standard library is the one that’s out of date.

Why Do We Want to Know Anyway?

“String length is about as meaningful a measurement as string height” – @qntm

Being able to allocate memory for strings gives a legitimate use case for knowing the storage length. However, in cases of Unicode scalar values or extended grapheme clusters, you typically want to iterate over them and look at each one instead of just knowing the count. So why do people want to know the count? As far as I can tell, there are two broad categories: Placing a quota limit that is fuzzy enough that it doesn’t need to be strictly tied to storage and trying to estimate how much text fits for display. Let’s look at the issue of estimating how much display space text takes, because it involves introducing yet another measurement of string length.

Display Space

Simply looking at the Latin letters i and m should make it clear that the display size of a string depends on the font and on the specific characters in the string. From this observation, the whole notion of estimating display space by counting characters seems folly. Indeed, if you want to know exactly how much text fits into a given space, you need to run a typesetting algorithm with a specific font, which may have a complex relationship between scalar values and glyphs, to actually see where the overflow starts. Yet, even in the case of the Latin script that has letters such as i and m, e.g. magazine editors can find character counts useful enough for estimating how many print pages an article of a given character count length is going to fill.

As for computer user interfaces, character terminal user interfaces use a monospaced font where both i and m take up one character cell on a grid. In the context of a monospaced font, the extended grapheme cluster count in the context of the Latin script corresponds directly to display space taken. The same obviously applies to the Greek and Cyrillic scripts, which are so close to the Latin script that fonts even intend to reuse glyphs across these scripts. In contrast, CJK ideographs, Japanese kana, and Hangul syllables take two cells of a terminal grid. From the CJK perspective, these are full-width characters and the ASCII characters are half-width characters. There exist also half-width katakana characters which fit into an 8-bit encoding with ASCII and take one cell on the terminal grid and, therefore, are technically easier to fit to Latin script-oriented terminal systems. The display width on a terminal also has a correspondence to byte with the legacy CJK encodings: ASCII takes one byte, a CJK ideograph, a full-width kana or a Hangul syllable takes two bytes. In the case of Shift_JIS, half-width katakana takes one byte per character.

This brings us to the concept of East Asian Width. ASCII and half-width katakana characters are narrow. CJK ideographs, full-width kana, and Hangul syllables are wide. However, even in the worldview that is split to Latin, Greek, and Cyrillic on one hand and Chinese, Japanese, and Korean on the other hand, there are ambiguities. From the perspective of European legacy encodings, Greek and Cyrillic (as well as accented Latin) is equally wide as ASCII. However, in legacy CJK encodings, Greek and Cyrillic characters take two bytes. This means that in terms of East Asian Width, a string can have a general-purpose width, which resolves these ambiguous characters as narrow, or legacy CJK-context width, which resolves these ambiguous characters as wide.

So is the general-purpose variant (that resolves Greek and Cyrillic characters as narrow) of East Asian Width the one true string length measure? Well, no.

First of all, the concept ignores all scripts that are geographically and in Unicode order between Latin, Greek, and Cyrillic on one hand and CJK on the other (even though some other scripts that are structurally similar to the Latin, Greek, and Cyrillic scripts and make sense for a monospaced font, such as Armenian and the Georgian scripts, fit this concept, too, despite not having a history in pre-Unicode CJK context). As it happens, though, emoji do fit into the concept, except for weird errors in the Unicode database. After all, emoji originate from Japan and were two bytes each when represented using the private use area of Shift_JIS.

Second, the concept assumes that there is one-to-one correspondence between scalar values and extended grapheme clusters. If we run this Rust program:

use unicode_width::UnicodeWidthStr;

fn main() {
    println!("{}", "🤦🏼‍♂️".width());
}

it prints:

5

This is because the base emoji is wide (2), the combining skin tone modifier is also wide (2), the male sign is counted as narrow (1), and the zero-width joiner and the variation selector are treated as control characters that don’t count towards width. Obviously, this is not the answer that we want. The answer we want is 2. Ideas that come to mind immediately, such as only counting the width of the first character in an extended grapheme cluster or taking the width of the widest character in an extended grapheme cluster, don’t work, because flag emoji consist of two regional indicator symbol letter characters both of which have East Asian Width of Neutral (i.e. they are counted as narrow but are not marked as narrow, because they are considered to exist outside the domain of East Asian typography). I’m not aware of any official Unicode definition that would reliably return 2 as the width of every kind of emoji. 😭

If you really must estimate display size without running text layout with a font, whether the extended grapheme cluster count or the East Asian Width of the string works better depends on context.

Arbitrary but Fair Quotas

In some cases there is a desire to impose a length limit that doesn’t arise from a strict storage limitation. For example, in the STUN protocol given earlier, presumably there is a desire to make it so that human-readable error messages cannot make protocol messages arbitrarily long. For example, in the case of Twitter, tweets being short is a core part of the type of expression that Twitter is about, so some definition of “short” is needed. In the case of string-based browser localStorage, there is a need to have some limit, but the limit is necessarily arbitrary and does not need to strictly map to bytes on disk.

In cases like this, there seems to be some concern that the limit should be internationally fair. Observations that UTF-8 and UTF-16 take a different amount of storage per character depending on the character superficially suggests that the UTF-8 length or the UTF-16 length might be unfair internationally.

What’s fair, though? The usual concern goes that UTF-8 favors English, because English takes one byte per character, and disfavors CJK, because Chinese, Japanese, and Korean take three bytes per character, so UTF-8 in unfair to CJK. This kind of analysis ignores how much information is conveyed per character. To assess what lengths we get for different languages when the amount of information conveyed is kept constant, I looked at the counts for the translations of the Universal Declaration of Human Rights. This is a document for which translation of the same content is available in particularly many languages, which is why I used it as the measurement corpus.

Unfortunately, not all translations contain the same text, so one needs to be careful when preparing the data for comparison. Some translations are incomplete, in some cases, very incomplete. For this reason, I included only translations in stage 4 or stage 5 along the 5-stage scale. Some translations carry the preamble with the recitals, but some do not. Some also carry historical notes. To make the length comparable, the preamble, notes, and whitespace-only text nodes were omitted. The rest of the XML text nodes were concatenated and normalized to Unicode Normalition Form C before counting. (Source code is available.)

Let’s look at the result. The table at the end of this document is sortable and is initially sorted by UTF-8 length. Each Δ% column shows how much the count in the column to its left deviates from the median count for that. (A note about color-coding. Coloring longer than median as red should not be taken to imply that those languages are somehow bad. It’s meant to imply that a length quota treats those languages badly.) In the table, the name of each language links to the translation in that language hosted on the site of the Unicode Consortium. The linked HTML versions may include the preamble and/or notes.

The CJK concern is alleviated when considering information conveyed. When measuring UTF-8 length, Mandarin using traditional characters is the shortest of the languages that have global name recognition! This should be expected, since the Han script pretty obviously packs more information per character than e.g. alphabetic scripts. (The globally less-known languages whose UTF-8 length is shorter than Mandarin’s (using traditional characters) are African and American Latin-script languages with a relatively small native speaker population for each—only one with a native speaker population exceeding a million and many whose native speaker population is smaller than 100 000, which explains why you might not recognize their names.)

Korean is also shorter than median in UTF-8 length. This also makes sense, since Hangul syllables pack three or two alphabetic jamo into one three-byte character. The UTF-8 length of Japanese is over median but only by 4.1%. The Japanese version of the text is 48% kanji and 52% hiragana. Japanese Wikipedia has almost the same kana to kanji ratio, though different kana: 46% kanji and the rest almost evenly split between hiragana and katakana, so we may assume the Universal Declaration of Human Rights to be representative of Japanese text in terms of kana to kanji ratio.

When sorting by UTF-16 code unit count, UTF-32 / scalar value count, or extended grapheme cluster count, CJK are the shortest. While it’s true that UTF-8 takes more bytes for CJK than UTF-16, the notion of UTF-8 being particularly disfavorable to CJK is not true relative to other languages. Rather, UTF-16 is particularly favorable to CJK. In particular, the Han script is so information-dense that even when sorting by East Asian Width, which effectively doubles the length of CJK but not other languages, Han-script languages stay clustered at the start of the table. Korean and Japanese move further but remain below median.

The language with the longest UTF-8 length is Shan, which uses the Burmese script. The Burmese language, also using the Burmese script, is the second-longest in UTF-8 length. There are a number of other Brahmic-script languages among the ones with the longest UTF-8 length. They use three bytes per character but don’t have CJK-like information per character density. These languages are below median in extended grapheme cluster count. In scalar value count, they intermingle with alphabetic languages.

It’s not clear if the concepts of median and mean (average) are meaningful. Does it make sense for a language with tens of millions of native speakers to count as an equal data point as a language with tens of thousands native speakers? Since this is about writing, should the numbers of writers be considered instead? (I.e. should literacy rates be taken into account?) In the hope that with a large number of languages in the table, median hand-wavily sorts out this kind of issue, I chose to compare with median. At least the Han-script languages have comparable numbers of native speakers as the Bhramic-script languages and provide a counter-weight at the other end of the spectrum of UTF-8 length. In any case, for measures other than UTF-8 length, median and mean are very close to each other.

Saying that Brahmic-script languages intermingle with alphabetic languages in character count is rather meaningless, though. In character count, after CJK (and Han-script Vietnamese and Yi-script Nousu), the language with the smallest character count is a Latin-script language (Waama). Also, the language with the largest character count is a Latin-script language (Ashéninka, Pichis). (I find it odd that in UTF-8 length Ashéninka Perené is the second-shortest but Ashéninka, Pichis is long enough to reach the Brahmic cluster. I don’t know what the relation of these two languages is and what explains two languages whose name suggests close relation ending up in opposite extremes in length. Update: It has been pointed out to me that the supposed Ashéninka Perené translation is a mislabeled duplicate of the Cashinahua translation.)

One might hypothesize that the Latin script has just been put to so many uses that some of the uses have to be far from what it has been optimized for. Yet, when considering language-specific alphabets, the character counts for Greek and Georgian are above median. It just is the case that languages are different. In that sense, the whole notion of trying to find a simple length measure that is fair across languages seems folly.

Let’s look at the the factor between the minimum and maximum of each measure, i.e. the factor with which the minimum needs to be multiplied to get the maximum. Let’s even ignore the outlier for maximum for each measure and use the second largest value instead of the largest value for each count. (Otherwise, Ashéninka, Pichis alone would skew the numbers a lot.) We get these factors:

UTF-88.6
UTF-167.9
UTF-327.9
EGC7.9
EAW4.3

UTF-16, UTF-32, and extended grapheme clusters aren’t distinguished by this measure, because the languages at the extremes use characters from the Basic Multilingual Plane with one character per grapheme cluster. Considering that there are supplementary-plane scripts, arguably the UTF-32 count would be fairer than the UTF-16 count even though this factor doesn’t show the difference. It’s not clear that counting extended grapheme clusters would be particularly fair compared to counting characters: It favors scripts that are visually joining over scripts that aren’t visually joining even if there’s no logical difference. While looking at just the factor, East Asian Width makes the gap the smallest, but it’s a rather imprecise fairness solution. It just counts CJK as double. Even after this, the Han-script languages are still among the ones with the smallest counts. On the other hand, it seems unfair to recognize Hangul syllables and kana as carrying more information than an alphabetic character while not giving the same treatment to other syllabaries, such as the Ethiopic script, Ge’ez.

Twitter counts each CJK character (including three-jamo Hangul syllables; i.e. it is not decomposing Hangul and treating it as alphabetic) as consuming 2 units of the quota (as when counting East Asian Width), counts emoji as consuming two units (even when East Asian Width of the cluster would be more), and, unlike East Asian Width, counts each Ethiopic syllable as consuming two units of the quota. What Twitter does seems fairer than just applying East Asian Width, but the result is still that the amount of information that can be packed in a tweet can vary four-fold depending on language. That still doesn’t seem exactly fair across languages.

In closing:

NameUTF-8Δ%UTF-16Δ%UTF-32Δ%EGCΔ%EAWΔ%Script
Cashinahua 4170-57.6 4135-53.0 4135-52.9 4135-52.3 4135-52.4 Latn
Ashéninka Perené 4170-57.6 4135-53.0 4135-52.9 4135-52.3 4135-52.4 Latn
Waama 4293-56.3 4011-54.4 4011-54.4 4007-53.8 4007-53.9 Latn
Chickasaw 4850-50.6 4685-46.7 4685-46.7 4587-47.1 4587-47.2 Latn
Bulu 4919-49.9 4808-45.3 4808-45.3 4808-44.5 4808-44.7 Latn
Kulango, Bouna 5286-46.2 4164-52.6 4164-52.6 4164-51.9 4164-52.1 Latn
Zapotec, Miahuatlán 5464-44.4 5433-38.2 5433-38.2 5433-37.3 5433-37.5 Latn
Nyamwezi 5750-41.5 5686-35.3 5686-35.3 5686-34.4 5686-34.6 Latn
Kaonde 5972-39.2 5972-32.1 5972-32.0 5972-31.1 5972-31.3 Latn
Mixtec, Metlatónoc 6023-38.7 5630-36.0 5630-35.9 5611-35.2 5611-35.4 Latn
Makonde 6100-37.9 5946-32.4 5946-32.3 5946-31.4 5946-31.6 Latn
Sharanahua 6165-37.3 6162-29.9 6162-29.9 6162-28.9 6162-29.1 Latn
Serer-Sine 6166-37.3 6079-30.9 6079-30.8 6079-29.8 6079-30.0 Latn
Dinka, Northeastern 6214-36.8 5815-33.9 5815-33.8 5775-33.4 5775-33.5 Latn
Okiek 6272-36.2 6272-28.7 6272-28.6 6272-27.6 6271-27.8 Latn
Jola-Fonyi 6299-35.9 6122-30.4 6122-30.3 6122-29.3 6122-29.5 Latn
Maninkakan, Eastern 6372-35.2 5867-33.3 5867-33.2 5867-32.3 5867-32.5 Latn
Chinantec, Ojitlán 6463-34.2 5957-32.3 5957-32.2 5957-31.3 5957-31.4 Latn
Soninke 6496-33.9 6430-26.9 6430-26.8 6430-25.8 6430-26.0 Latn
Chokwe (Angola) 6596-32.9 6565-25.3 6565-25.3 6565-24.2 6565-24.4 Latn
Chinese, Mandarin (Traditional) 6606-32.8 2202-75.0 2202-74.9 2202-74.6 4404-49.3 Hant
Otomi, Mezquital 6614-32.7 6438-26.8 6438-26.7 6379-26.4 6379-26.6 Latn
Chinese, Mandarin (Simplified) 6708-31.7 2278-74.1 2278-74.1 2278-73.7 4493-48.3 Hans
Quechua (Unified Quichua, old Hispanic orthography) 6713-31.7 6670-24.2 6670-24.1 6670-23.0 6670-23.2 Latn
Shilluk 6798-30.8 6036-31.4 6036-31.3 6036-30.3 6036-30.5 Latn
Colorado 6798-30.8 6797-22.7 6797-22.7 6796-21.6 6794-21.8 Latn
Dendi 6823-30.6 6327-28.1 6327-28.0 6325-27.0 6325-27.2 Latn
Chinese, Jinyu 6848-30.3 2284-74.0 2284-74.0 2284-73.6 4566-47.4 Hans
Chinese, Min Nan 6887-29.9 2297-73.9 2297-73.9 2297-73.5 4592-47.1 Hans
Chinese, Gan 6889-29.9 2297-73.9 2297-73.9 2297-73.5 4593-47.1 Hans
Vietnamese (Han nom) 6910-29.7 2564-70.8 2224-74.7 2224-74.3 4397-49.4 Hani
Chinese, Hakka 6929-29.5 2311-73.7 2311-73.7 2311-73.3 4620-46.8 Hans
Lunda 6968-29.1 6968-20.8 6968-20.7 6968-19.6 6968-19.8 Latn
Chinese, Yue 6973-29.0 2325-73.6 2325-73.5 2325-73.2 4648-46.5 Hani
Pular 6991-28.9 6991-20.5 6991-20.4 6991-19.3 6991-19.5 Latn
Limba, West-Central 7007-28.7 6257-28.8 6257-28.8 6257-27.8 6257-28.0 Latn
Naga, Ao 7019-28.6 6729-23.5 6729-23.4 6729-22.3 6729-22.5 Latn
Mazahua Central 7052-28.2 6750-23.2 6750-23.2 6517-24.8 6517-25.0 Latn
Chinese, Wu 7082-27.9 2362-73.1 2362-73.1 2362-72.7 4722-45.6 Hans
Kpelle, Guinea 7139-27.4 6136-30.2 6136-30.2 6136-29.2 6136-29.4 Latn
Amis 7206-26.7 7206-18.1 7206-18.0 7206-16.8 7206-17.1 Latn
Baatonum 7255-26.2 6788-22.8 6788-22.8 6779-21.8 6779-22.0 Latn
Tetun 7280-25.9 7280-17.2 7280-17.2 7280-16.0 7280-16.2 Latn
Chinantec, Chiltepec 7304-25.7 6468-26.4 6468-26.4 6262-27.7 6262-27.9 Latn
(Maiunan) 7312-25.6 7312-16.9 7312-16.8 7312-15.6 7312-15.8 Latn
Tetun Dili 7357-25.1 7225-17.8 7225-17.8 7225-16.6 7225-16.8 Latn
(Minjiang, written) 7366-25.0 7363-16.3 7363-16.2 7363-15.0 7363-15.3 Latn
Quechua, Cusco 7369-25.0 7309-16.9 7309-16.8 7309-15.6 7309-15.9 Latn
(Mijisa) 7393-24.8 7393-15.9 7393-15.9 7393-14.7 7392-14.9 Latn
Drung 7412-24.6 7412-15.7 7412-15.7 7412-14.5 7412-14.7 Latn
Mazatec, Ixcatlán 7442-24.3 7261-17.4 7261-17.4 7261-16.2 7261-16.4 Latn
Rwanda 7456-24.1 7456-15.2 7456-15.2 7456-14.0 7456-14.2 Latn
(Minjiang, spoken) 7512-23.6 7509-14.6 7509-14.6 7509-13.3 7509-13.6 Latn
Sukuma 7532-23.4 7452-15.3 7452-15.2 7452-14.0 7452-14.2 Latn
Makhuwa 7562-23.0 7398-15.9 7398-15.8 7398-14.6 7398-14.8 Latn
Aymara, Central 7568-23.0 7363-16.3 7363-16.2 7363-15.0 7363-15.3 Latn
Ido 7580-22.9 7580-13.8 7580-13.7 7580-12.5 7580-12.8 Latn
Záparo 7591-22.8 7583-13.8 7583-13.7 7583-12.5 7583-12.7 Latn
Bamanankan 7597-22.7 6890-21.7 6890-21.6 6890-20.5 6890-20.7 Latn
Nyankore 7628-22.4 7628-13.3 7628-13.2 7628-12.0 7628-12.2 Latn
Ndebele 7659-22.1 7659-12.9 7659-12.8 7659-11.6 7659-11.8 Latn
Sãotomense 7712-21.5 6956-20.9 6956-20.8 6956-19.7 6956-19.9 Latn
Pijin 7716-21.5 7716-12.3 7716-12.2 7716-11.0 7716-11.2 Latn
Latin 7747-21.2 7747-11.9 7747-11.8 7747-10.6 7747-10.8 Latn
Susu 7757-21.1 7310-16.9 7310-16.8 7310-15.6 7310-15.9 Latn
Oroqen 7768-21.0 7761-11.7 7761-11.7 7761-10.4 7761-10.7 Latn
Lozi 7825-20.4 7825-11.0 7825-11.0 7825-9.7 7825-9.9 Latn
Latin (1) 7869-19.9 7869-10.5 7869-10.5 7869-9.2 7869-9.4 Latn
Otuho 7890-19.7 7890-10.3 7890-10.2 7801-10.0 7712-11.2 Latn
Achuar-Shiwiar (1) 7893-19.7 7842-10.8 7842-10.8 7785-10.2 7728-11.0 Latn
Huastec (Veracruz) 7911-19.5 7882-10.4 7882-10.3 7882-9.0 7882-9.3 Latn
Umbundu (011) 7941-19.2 7910-10.1 7910-10.0 7910-8.7 7910-9.0 Latn
Nuosu 7953-19.1 2663-69.7 2663-69.7 2663-69.3 5308-38.9 Yiii
Even 7969-18.9 4320-50.9 4320-50.8 4320-50.1 4320-50.3 Cyrl
Q'eqchi' 7981-18.8 7981-9.2 7981-9.2 7981-7.9 7981-8.1 Latn
Moba 7985-18.7 7726-12.1 7726-12.1 7726-10.8 7726-11.1 Latn
Mam, Northern 7994-18.7 7994-9.1 7994-9.0 7994-7.7 7994-8.0 Latn
Kabiyé 7997-18.6 6193-29.6 6193-29.5 6193-28.5 6193-28.7 Latn
Kanuri, Central 8077-17.8 7621-13.3 7621-13.3 7621-12.0 7621-12.3 Latn
Esperanto 8095-17.6 7930-9.8 7930-9.8 7930-8.5 7930-8.7 Latn
Serbian (Latin) 8102-17.6 7876-10.4 7876-10.4 7876-9.1 7876-9.3 Latn
Urarina 8127-17.3 8125-7.6 8125-7.5 8125-6.2 8125-6.5 Latn
Kurdish, Central 8163-16.9 7462-15.1 7462-15.1 7462-13.9 7462-14.1 Latn
Kurdish, Northern 8163-16.9 7462-15.1 7462-15.1 7462-13.9 7462-14.1 Latn
Huitoto, Murui 8179-16.8 7523-14.5 7523-14.4 7523-13.2 7523-13.4 Latn
Croatian 8201-16.5 7996-9.1 7996-9.0 7996-7.7 7996-8.0 Latn
Bemba 8206-16.5 8206-6.7 8206-6.6 8206-5.3 8206-5.5 Latn
Waorani 8209-16.5 8137-7.5 8137-7.4 8052-7.1 7967-8.3 Latn
Gonja 8215-16.4 7579-13.8 7579-13.8 7579-12.5 7579-12.8 Latn
Scots 8224-16.3 8224-6.5 8224-6.4 8224-5.1 8224-5.3 Latn
Ndonga 8239-16.2 8239-6.3 8239-6.2 8239-4.9 8239-5.2 Latn
Garifuna 8243-16.1 7721-12.2 7721-12.1 7721-10.9 7721-11.1 Latn
Bosnian (Latin) 8259-16.0 8049-8.5 8049-8.4 8049-7.1 8049-7.4 Latn
Twi (Akuapem) 8264-15.9 7653-13.0 7653-12.9 7653-11.7 7653-11.9 Latn
Zulu 8265-15.9 8261-6.1 8261-6.0 8261-4.7 8261-4.9 Latn
Guarayu 8280-15.7 8098-7.9 8098-7.9 8098-6.5 8098-6.8 Latn
Swahili 8315-15.4 8315-5.4 8315-5.4 8315-4.0 8315-4.3 Latn
Zhuang, Yongbei 8318-15.4 8316-5.4 8316-5.4 8316-4.0 8316-4.3 Latn
Wolof 8321-15.3 7940-9.7 7940-9.6 7940-8.4 7940-8.6 Latn
Zapotec, Güilá 8364-14.9 8328-5.3 8328-5.2 8328-3.9 8328-4.1 Latn
Oromo, Borana-Arsi-Guji 8381-14.7 8381-4.7 8381-4.6 8381-3.3 8381-3.5 Latn
Welsh 8382-14.7 8247-6.2 8247-6.2 8247-4.8 8247-5.1 Latn
Tok Pisin 8399-14.5 8393-4.6 8393-4.5 8393-3.1 8393-3.4 Latn
Awa-Cuaiquer 8405-14.5 8391-4.6 8391-4.5 8309-4.1 8227-5.3 Latn
Luvale 8411-14.4 8411-4.4 8411-4.3 8411-2.9 8411-3.2 Latn
Crioulo, Upper Guinea (008) 8414-14.4 8225-6.5 8225-6.4 8225-5.1 8225-5.3 Latn
Afrikaans 8427-14.2 8365-4.9 8365-4.8 8365-3.5 8365-3.7 Latn
Faroese 8454-14.0 7854-10.7 7854-10.6 7854-9.4 7854-9.6 Latn
Fulfulde, Nigerian 8455-14.0 8135-7.5 8135-7.4 8135-6.1 8135-6.4 Latn
Norwegian, Nynorsk 8461-13.9 8268-6.0 8268-5.9 8268-4.6 8268-4.8 Latn
Yagua 8468-13.8 8432-4.1 8432-4.1 8432-2.7 8432-2.9 Latn
Rundi 8498-13.5 8498-3.4 8498-3.3 8498-1.9 8498-2.2 Latn
Norwegian, Bokmål 8500-13.5 8360-4.9 8360-4.9 8360-3.5 8360-3.8 Latn
Umbundu 8503-13.5 8415-4.3 8415-4.2 8415-2.9 8415-3.1 Latn
English 8565-12.8 8555-2.7 8555-2.7 8555-1.3 8555-1.5 Latn
Yao 8574-12.8 8574-2.5 8574-2.4 8574-1.1 8574-1.3 Latn
Nomatsiguenga 8575-12.7 8432-4.1 8432-4.1 8432-2.7 8432-2.9 Latn
Mapudungun 8585-12.6 8366-4.9 8366-4.8 8366-3.5 8366-3.7 Latn
Fijian 8586-12.6 8584-2.4 8584-2.3 8584-0.9 8584-1.2 Latn
Tamazight, Central Atlas 8587-12.6 8226-6.5 8226-6.4 8226-5.1 8226-5.3 Latn
Nyanja (Chinyanja) 8590-12.6 8590-2.3 8590-2.3 8590-0.9 8590-1.1 Latn
Yapese 8635-12.1 8473-3.7 8473-3.6 8473-2.2 8473-2.5 Latn
Crioulo, Upper Guinea 8636-12.1 8632-1.8 8632-1.8 8632-0.4 8632-0.6 Latn
Secoya 8651-12.0 8155-7.3 8155-7.2 8137-6.1 8137-6.3 Latn
Wayuu 8664-11.8 8077-8.2 8077-8.1 8077-6.8 8077-7.0 Latn
Lingala 8668-11.8 8654-1.6 8654-1.5 8654-0.1 8654-0.4 Latn
Haitian Creole French (Kreyol) 8680-11.7 8535-2.9 8535-2.9 8535-1.5 8535-1.8 Latn
Tonga 8685-11.6 8685-1.2 8685-1.2 86850.2 8685-0.0 Latn
Seselwa Creole French 8706-11.4 8697-1.1 8697-1.0 86970.4 86970.1 Latn
Mende 8707-11.4 8010-8.9 8010-8.9 8010-7.6 8010-7.8 Latn
Nyanja (Chechewa) 8725-11.2 8725-0.8 8725-0.7 87250.7 87250.4 Latn
Hani 8767-10.8 8767-0.3 8767-0.2 87671.2 87670.9 Latn
Slovenian 8772-10.7 8520-3.1 8520-3.0 8520-1.7 8520-1.9 Latn
Hmong, Southern Qiandong 8792-10.5 8792-0.0 87920.0 87921.5 87921.2 Latn
Chokwe 8808-10.4 88080.2 88080.2 88081.7 88081.4 Latn
Pipil 8831-10.1 88250.4 88250.4 88251.8 88251.6 Latn
(Bizisa) 8847-10.0 88470.6 88470.7 88472.1 88471.8 Latn
Quechua, Cajamarca 8858-9.9 88510.6 88510.7 88512.1 88511.9 Latn
Kasem 8868-9.8 8445-4.0 8445-3.9 8445-2.5 8445-2.8 Latn
Romani, Balkan 8875-9.7 8606-2.1 8606-2.1 8606-0.7 8606-0.9 Latn
Turkish 8877-9.7 8225-6.5 8225-6.4 8225-5.1 8225-5.3 Latn
Fante 8898-9.5 8229-6.4 8229-6.4 8229-5.0 8229-5.3 Latn
Basque 8907-9.4 89071.3 89071.4 89072.8 89072.5 Latn
Ganda 8962-8.8 89621.9 89622.0 89623.4 89623.2 Latn
Occitan 8963-8.8 8661-1.5 8661-1.4 8661-0.0 8661-0.3 Latn
Xhosa 8969-8.7 88811.0 88811.1 88812.5 88812.2 Latn
Breton 8982-8.6 8661-1.5 8661-1.4 8661-0.0 8661-0.3 Latn
Veps 8985-8.6 8428-4.2 8428-4.1 8428-2.7 8428-3.0 Latn
Quechua, Arequipa-La Unión 8988-8.5 89692.0 89692.1 89693.5 89693.2 Latn
Friulian 9003-8.4 8688-1.2 8688-1.1 86880.3 86880.0 Latn
Swedish 9008-8.3 8612-2.1 8612-2.0 8612-0.6 8612-0.9 Latn
Danish 9010-8.3 88310.4 88310.5 88311.9 88311.6 Latn
Aromanian 9020-8.2 8694-1.1 8694-1.1 86940.3 86940.1 Latn
Madura 9023-8.2 90232.6 90232.7 90234.1 90233.9 Latn
Romani, Balkan (1) 9035-8.1 8739-0.6 8739-0.6 87390.9 87390.6 Latn
Chayahuita 9065-7.8 8639-1.8 8639-1.7 8639-0.3 8639-0.6 Latn
Icelandic 9070-7.7 8249-6.2 8249-6.1 8249-4.8 8249-5.1 Latn
Krio 9086-7.5 8139-7.4 8139-7.4 8139-6.1 8139-6.3 Latn
Estonian 9093-7.5 88000.1 88000.1 88001.6 88001.3 Latn
Aja 9099-7.4 8077-8.2 8077-8.1 8069-6.9 8069-7.1 Latn
Sorbian, Upper 9108-7.3 8442-4.0 8442-3.9 8442-2.6 8442-2.8 Latn
Sotho, Southern 9136-7.0 91363.9 91364.0 91365.4 91365.2 Latn
Catalan-Valencian-Balear 9141-7.0 88230.3 88230.4 88231.8 88231.6 Latn
Luba-Kasai 9143-7.0 91434.0 91434.0 91435.5 91435.2 Latn
Minangkabau 9175-6.6 91674.2 91674.3 91675.8 91675.5 Latn
Bari 9178-6.6 8555-2.7 8555-2.7 8555-1.3 8555-1.5 Latn
Portuguese (Brazil) 9219-6.2 88871.1 88871.1 88872.6 88872.3 Latn
Huastec (San Luís Potosí) 9222-6.2 88260.4 88260.4 88261.9 88261.6 Latn
Czech 9225-6.1 8126-7.6 8126-7.5 8126-6.2 8126-6.5 Latn
Purepecha 9234-6.0 90823.3 90823.3 90824.8 90824.5 Latn
Fon 9244-5.9 7952-9.6 7952-9.5 7943-8.3 7943-8.6 Latn
Twi (Asante) 9246-5.9 8374-4.8 8374-4.7 8374-3.4 8374-3.6 Latn
Papiamentu 9249-5.9 92375.0 92375.1 92376.6 92376.3 Latn
Slovak 9266-5.7 8378-4.7 8378-4.7 8378-3.3 8378-3.6 Latn
Malagasy, Plateau 9272-5.6 92725.4 92725.5 92727.0 92726.7 Latn
Romansch (Vallader) 9300-5.4 90482.9 90483.0 90484.4 90484.1 Latn
Ladin 9324-5.1 8740-0.6 8740-0.5 87400.9 87400.6 Latn
Mbundu 9327-5.1 93175.9 93176.0 93177.5 93177.2 Latn
Occitan (Auvergnat) 9330-5.1 8642-1.7 8642-1.7 8642-0.3 8642-0.5 Latn
Lithuanian 9339-5.0 87940.0 87940.1 87941.5 87941.2 Latn
Ladino 9348-4.9 93456.3 93456.3 93457.8 93457.6 Latn
Mískito 9353-4.8 93456.3 93456.3 93457.8 93457.6 Latn
Assyrian Neo-Aramaic 9363-4.7 5186-41.0 5186-41.0 5127-40.8 5127-41.0 Syrc
Waray-Waray 9387-4.5 93876.7 93876.8 93878.3 93878.0 Latn
Korean 9391-4.4 3856-56.2 3856-56.1 3856-55.5 6623-23.8 Hang
Somali 9403-4.3 94036.9 94037.0 94038.5 94038.2 Latn
Finnish 9404-4.3 90232.6 90232.7 90234.1 90233.9 Latn
Romansch (Sursilvan) 9421-4.1 93005.8 93005.8 93007.3 93007.0 Latn
Chin, Tedim 9441-3.9 94317.2 94317.3 94318.8 94318.6 Latn
Latvian 9447-3.9 8582-2.4 8582-2.3 8582-1.0 8582-1.2 Latn
Romansch (Grischun) 9449-3.8 92935.7 92935.7 92937.2 92937.0 Latn
Gagauz 9451-3.8 8510-3.2 8510-3.2 8510-1.8 8510-2.0 Latn
Dagbani 9458-3.8 88961.2 88961.2 88962.7 88962.4 Latn
Finnish, Kven 9464-3.7 91233.7 91233.8 91235.3 91235.0 Latn
Corsican 9475-3.6 89221.5 89221.5 89223.0 89222.7 Latn
Koongo (Angola) 9486-3.5 94167.1 94167.1 94168.7 93567.7 Latn
Ditammari 9487-3.5 7867-10.5 7867-10.5 7748-10.6 7748-10.8 Latn
Portuguese (Portugal) 9501-3.3 91544.1 91544.2 91545.6 91545.4 Latn
Manx 9504-3.3 94407.3 94407.4 94408.9 94408.7 Latn
Chamorro 9506-3.3 95048.1 95048.1 95049.7 95049.4 Latn
Galician 9510-3.2 92234.9 92234.9 92236.4 92236.2 Latn
Occitan (Languedocien) 9522-3.1 93646.5 93646.6 93648.1 93647.8 Latn
Romansch (Puter) 9538-2.9 93035.8 93035.9 93037.4 93037.1 Latn
Ligurian 9557-2.7 89421.7 89421.8 89423.2 89422.9 Latn
Quechua, Huaylas Ancash 9563-2.7 94717.7 94717.8 94719.3 94719.0 Latn
Mizo 9576-2.6 94897.9 94898.0 94899.5 94899.2 Latn
Tiv 9585-2.5 94907.9 94908.0 94909.5 94909.2 Latn
Interlingua 9588-2.4 95889.0 95889.1 958810.7 958810.4 Latn
Koongo 9596-2.4 95969.1 95969.2 959610.7 959610.5 Latn
Pohnpeian 9603-2.3 96039.2 96039.3 960310.8 960310.5 Latn
Polish 9613-2.2 91113.6 91113.7 91115.1 91114.9 Latn
Ga 9614-2.2 8262-6.0 8262-6.0 8257-4.7 8257-5.0 Latn
Kituba 9630-2.0 96309.5 96309.6 963011.1 963010.8 Latn
Palauan 9654-1.8 96549.8 96549.9 965411.4 965411.1 Latn
Guaraní, Paraguayan 9658-1.7 89561.8 89561.9 89563.4 89563.1 Latn
Frisian, Western 9660-1.7 94958.0 94958.0 94959.6 94959.3 Latn
Albanian, Tosk 9703-1.3 89722.0 89722.1 89723.5 89723.3 Latn
Italian 9739-0.9 967410.0 967410.1 967411.6 967411.3 Latn
Marshallese 9758-0.7 975811.0 975811.0 975812.6 975812.3 Latn
Spanish 9759-0.7 95748.9 95748.9 957410.5 957410.2 Latn
Venetian 9764-0.6 90833.3 90833.4 90834.8 90834.5 Latn
Romansch (Sutsilvan) 9764-0.6 94597.6 94597.6 94599.2 94598.9 Latn
Huastec (Sierra de Otontepec) 9778-0.5 94307.2 94307.3 94308.8 94308.5 Latn
Comorian, Ngazidja 9783-0.4 978311.2 978311.3 978312.9 978312.6 Latn
Lamnso' 9792-0.4 7828-11.0 7828-10.9 7648-11.7 7648-12.0 Latn
Hawaiian 9812-0.2 8588-2.3 8588-2.3 8588-0.9 8588-1.2 Latn
Romansch (Surmiran) 98270.0 96629.9 96629.9 966211.5 966211.2 Latn
German, Standard (1996) 98280.0 969610.3 969610.3 969611.9 969611.6 Latn
Mixe, Totontepec 98290.0 8351-5.0 8351-5.0 8351-3.6 8351-3.9 Latn
German, Standard (1901) 98300.0 969210.2 969210.3 969211.9 969211.6 Latn
Talysh 98360.1 8180-7.0 8180-6.9 8180-5.6 8180-5.8 Latn
Aceh 98450.2 972910.6 972910.7 972912.3 972912.0 Latn
Maltese 98460.2 91984.6 91984.7 91986.2 91985.9 Latn
Chin, Matu 98540.3 984011.9 984012.0 984013.6 984013.3 Latn
Asturian 98580.3 96369.6 96369.6 963611.2 963610.9 Latn
Gaelic, Scottish 98590.3 96469.7 96469.8 964611.3 964611.0 Latn
Chuukese 98780.5 987812.3 987812.4 987814.0 987813.7 Latn
Nyemba 98820.6 988112.4 988112.4 988114.0 988113.7 Latn
Amarakaeri 99170.9 94998.0 94998.1 90864.9 90864.6 Latn
Candoshi-Shapra 99180.9 986212.1 986212.2 986213.8 986213.5 Latn
Siona 99331.1 91614.2 91614.2 88261.9 87480.7 Latn
Dangme 99361.1 87960.0 87960.1 87791.3 87791.0 Latn
Shona 99431.2 994313.1 994313.1 994314.7 994314.4 Latn
Páez 99801.6 986912.2 986912.3 986913.9 986913.6 Latn
Romansch 100031.8 986612.2 986612.3 986613.9 986613.6 Latn
Pampangan 100051.8 1000513.8 1000513.8 1000515.5 1000515.2 Latn
Cebuano 100081.8 1000813.8 1000813.9 1000815.5 1000815.2 Latn
Tagalog 100131.9 1001313.9 1001313.9 1001315.6 1001315.3 Latn
Romagnolo 100292.1 95118.2 95118.2 95119.8 95119.5 Latn
French 100302.1 95989.1 95989.2 959810.8 959810.5 Latn
Sotho, Northern 100362.1 977111.1 977111.2 977112.8 977112.5 Latn
Indonesian 100592.4 1005914.4 1005914.5 1005916.1 1005915.8 Latn
Tswana 100672.4 1004714.2 1004714.3 1004715.9 1004715.6 Latn
Bugis 100702.5 1007014.5 1007014.6 1007016.2 1007015.9 Latn
Sunda 100712.5 1007114.5 1007114.6 1007116.2 1007115.9 Latn
Uzbek, Northern (Latin) 100882.7 983611.8 983611.9 983613.5 983613.2 Latn
Gaelic, Irish 101142.9 95919.1 95919.1 959110.7 959110.4 Latn
Hindustani, Sarnami 101162.9 996313.3 996313.4 996315.0 996314.7 Latn
Tzeltal, Oxchuc 101193.0 978011.2 978011.3 978012.9 978012.6 Latn
Turkmen (Latin) 101243.0 91854.4 91854.5 91856.0 91855.7 Latn
Dagaare, Southern 101413.2 94958.0 94958.0 94779.4 94779.1 Latn
Igbo 101513.3 8653-1.6 8653-1.5 8653-0.1 8653-0.4 Latn
Picard 101513.3 91754.3 91754.4 91755.9 91755.6 Latn
Micmac 101623.4 92345.0 92345.1 92346.6 92346.3 Latn
Uyghur (Latin) 101863.7 999913.7 999913.8 999915.4 999915.1 Latn
Malay (Latin) 101893.7 1018815.9 1018815.9 1018817.6 1018817.3 Latn
Azerbaijani, North (Latin) 101983.8 8717-0.9 8717-0.8 87170.6 87170.3 Latn
Japanese 102274.1 3437-60.9 3437-60.9 3437-60.3 6832-21.4 Jpan
Bislama 102334.1 1023316.4 1023316.4 1023318.1 1023317.8 Latn
Bali 102354.2 1023516.4 1023516.5 1023518.1 1023517.8 Latn
Occitan (Francoprovençal, Savoie) 102404.2 8665-1.5 8665-1.4 86650.0 8665-0.3 Latn
Themne 102444.2 8323-5.4 8323-5.3 8323-3.9 8323-4.2 Latn
Karelian 102454.3 987412.3 987412.4 976112.6 964811.0 Latn
Dutch 102474.3 1024616.5 1024616.6 1024618.2 1024617.9 Latn
Bamun 102484.3 8744-0.6 8744-0.5 87440.9 87440.6 Latn
Edo 102624.4 1026016.7 1026016.8 1026018.4 1026018.1 Latn
Bicolano, Central 102634.4 1026316.7 1026316.8 1026318.4 1026318.1 Latn
Tsonga (Mozambique) 102744.5 1004714.2 1004714.3 1004715.9 1004715.6 Latn
Quechua, Ayacucho 102954.8 1027316.8 1027316.9 1027318.6 1027318.2 Latn
Mina 103004.8 90853.3 90853.4 90404.3 90404.1 Latn
Romanian (2006) 103034.8 968310.1 968310.2 968311.7 968311.5 Latn
Luxembourgeois 103064.9 999813.7 999813.8 999815.4 999815.1 Latn
Romanian (1993) 103114.9 969110.2 969110.3 969111.8 969111.5 Latn
Romanian (1953) 103175.0 969110.2 969110.3 969111.8 969111.5 Latn
Mozarabic 103175.0 1018415.8 1018415.9 1018417.5 1018417.2 Latn
Sardinian, Logudorese 103235.0 1019515.9 1019516.0 1019517.7 1019517.3 Latn
Haitian Creole French (Popular) 103395.2 1010314.9 1010315.0 1010316.6 1010316.3 Latn
Hiligaynon 104055.9 1040518.3 1040518.4 1040520.1 1040519.8 Latn
Shor 104146.0 5724-34.9 5724-34.9 5724-33.9 5724-34.1 Cyrl
Sango 104286.1 8644-1.7 8644-1.6 8644-0.2 8644-0.5 Latn
Ilocano 104296.1 1042918.6 1042918.7 1042920.4 1042920.0 Latn
Occitan (Francoprovençal, Fribourg) 104396.2 92264.9 92265.0 92266.5 92266.2 Latn
Niue 104446.3 1044318.8 1044318.8 1044320.5 1044320.2 Latn
Comorian, Maore 104586.4 1034017.6 1034017.7 1034019.3 1034019.0 Latn
Chin, Falam 104676.5 1046719.0 1046719.1 1046720.8 1046720.5 Latn
Ibibio 104686.5 1046719.0 1046719.1 1046720.8 1046720.5 Latn
Lingala (tones) 104766.6 89902.2 89902.3 87601.1 87600.8 Latn
Hebrew 105026.9 5822-33.8 5822-33.8 5822-32.8 5822-33.0 Hebr
Saxon, Low 105397.2 1031817.3 1031817.4 1031819.1 1031818.8 Latn
Venda 106208.1 1010614.9 1010615.0 1010616.6 1010616.3 Latn
Mòoré 106218.1 94277.2 94277.3 94278.8 94278.5 Latn
Quichua, Chimborazo Highland 106518.4 1054920.0 1054920.0 1043620.4 1032318.8 Latn
Saami, North 106548.4 994413.1 994413.2 994414.8 994414.5 Latn
Occitan (Francoprovençal, Valais) 106628.5 94137.0 94137.1 94138.6 94138.3 Latn
Walloon 107149.0 978511.3 978511.3 978512.9 978512.6 Latn
Hungarian 107189.1 978311.2 978311.3 978312.9 978312.6 Latn
Nzema 107409.3 94397.3 94397.4 94398.9 94398.6 Latn
Tsonga (Zimbabwe) 107589.5 1054619.9 1054620.0 1054621.7 1054621.4 Latn
Quechua, North Junín 107659.5 1075622.3 1075622.4 1075624.1 1075623.8 Latn
Hmong, Northern Qiandong 108019.9 1080122.8 1080122.9 1080124.7 1080124.3 Latn
Khasi 1081010.0 1060520.6 1060520.7 1060522.4 1060522.1 Latn
K'iche', Central 1081710.1 1081723.0 1081723.1 1081724.8 1081724.5 Latn
Javanese (Latin) 1086310.5 1086323.5 1086323.6 1086325.4 1086325.0 Latn
Occitan (Francoprovençal, Vaud) 1088510.8 975711.0 975711.0 975712.6 975712.3 Latn
Shuar 1093011.2 1053319.8 1053319.9 1053321.6 1053321.2 Latn
Baoulé 1094611.4 1020416.0 1020416.1 1020417.8 1020417.4 Latn
Totonac, Papantla 1095511.5 1095524.6 1095524.7 1095526.4 1095526.1 Latn
Evenki 1096211.5 5948-32.4 5948-32.3 5776-33.3 5776-33.5 Cyrl
Kabuverdianu 1097111.6 1033417.5 1033417.6 1033419.3 1032518.8 Latn
Jula 1103812.3 8719-0.9 8719-0.8 87190.6 87190.4 Latn
Éwé 1110713.0 996713.3 996713.4 995014.8 995014.5 Latn
Asháninka 1116713.6 1116427.0 1116427.0 1116428.8 1116428.5 Latn
Hmong Njua 1117913.8 1117927.1 1117927.2 1117929.0 1117928.7 Latn
Mbundu (009) 1120014.0 1113326.6 1113326.7 1113328.5 1113328.1 Latn
Arabic, Standard 1121414.1 6183-29.7 6183-29.6 6166-28.8 6166-29.0 Arab
Samoan 1123114.3 1123127.7 1123127.8 1123129.6 1123129.3 Latn
Quechua, Margos-Yarowilca-Lauricocha 1126014.6 1110826.3 1110826.4 1110828.2 1110827.9 Latn
Achuar-Shiwiar 1129915.0 1129628.5 1129628.5 1129630.4 1129630.0 Latn
Tojolabal 1146516.7 1017315.7 1017315.8 1017317.4 1017317.1 Latn
Bushi 1148716.9 1098024.9 1098024.9 1098026.7 1098026.4 Latn
Osetin 1152817.3 6370-27.6 6370-27.5 6370-26.5 6370-26.7 Cyrl
Tzotzil (Chamula) 1155817.6 1070321.7 1070321.8 1070323.5 1070323.2 Latn
Rarotongan 1156217.7 1152731.1 1152731.2 1152733.0 1152732.7 Latn
Maya, Yucatán 1173219.4 1067521.4 1067521.5 1067523.2 1067522.9 Latn
Quechua, Northern Conchucos Ancash 1178619.9 1178234.0 1178234.1 1178236.0 1178235.6 Latn
Yanomamö 1191321.2 1047019.1 1047019.1 1047020.8 1047020.5 Latn
Aguaruna 1191821.3 1185434.8 1185434.9 1185436.8 1185436.4 Latn
Hausa (Niger) 1207822.9 1183134.5 1183134.6 1183136.5 1183136.2 Latn
Hausa (Nigeria) 1207822.9 1186334.9 1186335.0 1186336.9 1186336.5 Latn
Vietnamese 1218224.0 88770.9 88771.0 88772.4 88772.2 Latn
Chin, Haka 1223124.5 1223139.1 1223139.2 1223141.2 1223140.8 Latn
Quechua, Ambo-Pasco 1232725.4 1218138.5 1218138.6 1218140.6 1218140.2 Latn
Cashibo-Cacataibo 1234925.7 1151430.9 1151431.0 1151432.9 1151432.5 Latn
Tem 1241826.4 88781.0 88781.0 8246-4.8 8246-5.1 Latn
Ojibwa, Northwestern 1241926.4 4775-45.7 4775-45.7 4775-44.9 4775-45.0 Cans
Pidgin, Nigerian 1242426.4 1242441.3 1242441.4 1242443.4 1242443.0 Latn
Tahitian 1244926.7 1224439.2 1224439.3 1224441.3 1224440.9 Latn
Amahuaca 1253027.5 1253042.5 1253042.6 1253044.6 1253044.2 Latn
Lobi 1264528.7 1043518.7 1043518.7 1043520.4 1043520.1 Latn
Cree, Swampy 1270529.3 4849-44.9 4849-44.8 4849-44.0 4849-44.2 Cans
Navajo 1283530.6 998113.5 998113.6 980313.1 980312.8 Latn
Quechua, South Bolivian 1292431.5 1290246.7 1290246.8 1290248.9 1290248.5 Latn
Kaqchikel, Central 1294331.7 1261643.5 1261643.6 1261645.6 1261645.2 Latn
Maori 1299432.2 1299347.7 1299347.8 1299349.9 1299349.6 Latn
Seraiki 1302032.5 7303-17.0 7303-16.9 7302-15.7 7302-16.0 Arab
Ticuna 1313733.7 1050819.5 1050819.6 988614.1 988613.8 Latn
Arabela 1325634.9 1325550.7 1325550.8 1325553.0 1325552.6 Latn
Swati 1337236.1 1332051.5 1332051.6 1332053.7 1332053.3 Latn
Komi-Permyak 1349937.4 7378-16.1 7378-16.0 7378-14.9 7378-15.1 Cyrl
Farsi, Western 1359738.4 7537-14.3 7537-14.2 7460-13.9 7460-14.1 Arab
Yukaghir, Northern 1361838.6 7366-16.2 7366-16.2 7366-15.0 7366-15.2 Cyrl
Dari 1366939.1 7607-13.5 7607-13.4 7561-12.7 7561-13.0 Arab
Pintupi-Luritja 1373639.8 1373656.2 1373656.3 1373658.5 1373658.1 Latn
Urdu 1385941.0 7768-11.7 7768-11.6 7733-10.8 7733-11.0 Arab
Panjabi, Western 1399642.4 7904-10.1 7904-10.1 7893-8.9 7893-9.2 Arab
Tongan 1401742.6 1245341.6 1245341.7 1245343.7 1245343.3 Latn
Yoruba 1405943.1 1023816.4 1023816.5 92767.1 92766.8 Latn
Inuktitut, Greenlandic 1406743.1 1406760.0 1406760.1 1406762.3 1406761.9 Latn
Serbian (Cyrillic) 1409043.4 7740-12.0 7740-11.9 7740-10.7 7740-10.9 Cyrl
Urdu (2) 1410843.6 7904-10.1 7904-10.1 7868-9.2 7868-9.4 Arab
Nanai 1414844.0 7666-12.8 7666-12.8 7636-11.9 7636-12.1 Cyrl
Caquinte 1425045.0 1424662.0 1424662.1 1424664.4 1424664.0 Latn
Tigrigna 1427045.2 5502-37.4 5502-37.4 5502-36.5 5502-36.7 Ethi
Bosnian (Cyrillic) 1440446.6 7906-10.1 7906-10.0 7906-8.8 7906-9.0 Cyrl
Malay (Arabic) 1441046.6 7899-10.2 7899-10.1 7899-8.8 7899-9.1 Arab
Konjo 1462048.8 1462066.2 1462066.4 1462068.7 1462068.3 Latn
Pashto, Northern 1472749.9 8276-5.9 8276-5.8 8274-4.5 8274-4.8 Arab
Bora 1493452.0 1181934.4 1181934.5 1165934.6 1165934.2 Latn
Quechua, Huamalíes-Dos de Mayo Huánuco 1497352.4 1477268.0 1477268.1 1477270.5 1477270.0 Latn
Toba 1525055.2 1467266.8 1467267.0 1467269.3 1467268.9 Latn
Nahuatl, Central 1546057.3 1545775.8 1545775.9 1545778.4 1545777.9 Latn
Vai 1555558.3 6931-21.2 6931-21.1 6931-20.0 6931-20.2 Vaii
Tatar 1560158.8 8493-3.4 8493-3.4 8493-2.0 8493-2.2 Cyrl
Tajiki 1560658.8 8594-2.3 8594-2.2 8594-0.8 8594-1.1 Cyrl
Macedonian 1584361.2 8704-1.0 8704-1.0 87040.5 87040.2 Cyrl
Ukrainian 1610963.9 8785-0.1 8785-0.0 87851.4 87851.1 Cyrl
Azerbaijani, North (Cyrillic) 1611764.0 8733-0.7 8733-0.6 87330.8 87330.5 Cyrl
Orok 1611864.0 8696-1.1 8696-1.0 8251-4.8 8251-5.0 Cyrl
Amharic 1614464.3 5382-38.8 5382-38.8 5382-37.9 5382-38.1 Ethi
Kazakh 1627365.6 8791-0.0 87910.0 87911.5 87911.2 Cyrl
Mongolian, Halh (Cyrillic) 1629565.8 88370.5 88370.6 88372.0 88371.7 Cyrl
Tamazight, Standard Morocan 1630165.9 6371-27.6 6371-27.5 6371-26.5 6371-26.7 Tfng
Turkmen (Cyrillic) 1643867.3 88260.4 88260.4 88261.9 88261.6 Cyrl
Altai, Southern 1650868.0 88650.8 88650.9 88652.3 88652.0 Cyrl
Shipibo-Conibo 1667469.7 1639186.4 1639186.5 1639189.2 1639188.7 Latn
Bulgarian 1684471.4 92284.9 92285.0 92286.5 92286.2 Cyrl
Armenian 1685371.5 90382.8 90382.8 90384.3 90384.0 Armn
Chachi 1704273.4 1691292.3 1691292.4 1691195.2 1691094.6 Latn
Belarusan 1711774.2 93075.8 93075.9 93077.4 93077.1 Cyrl
Tai Dam 1730176.1 7181-18.3 7181-18.3 6423-25.9 6423-26.1 Tavt
Abkhaz 1731876.2 92805.5 92805.6 92807.1 92806.8 Cyrl
Yaneshaʼ 1733676.4 1585180.2 1585180.4 1523875.9 1523875.4 Latn
Uzbek, Northern (Cyrillic) 1739477.0 93646.5 93646.6 93648.1 93647.8 Cyrl
Adyghe 1743277.4 94837.8 94837.9 94839.4 94839.2 Cyrl
Kirghiz 1749078.0 93906.8 93906.9 93908.4 93908.1 Cyrl
Nganasan 1752778.4 93366.2 93366.2 93367.7 93367.5 Cyrl
Yiddish, Eastern 1758979.0 95939.1 95939.2 8621-0.5 8621-0.8 Hebr
Yakut 1761579.3 94707.7 94707.8 94709.3 94709.0 Cyrl
Khakas 1761679.3 95548.6 95548.7 955410.3 955410.0 Cyrl
Tuva 1771780.3 95728.8 95728.9 957210.5 957210.2 Cyrl
Russian 1775080.6 96059.2 96059.3 960510.8 960510.6 Cyrl
Matsés 1778881.0 1733697.1 1733697.3 17336100.1 1733699.5 Latn
Kabardian 1787981.9 96339.5 96339.6 963311.2 963310.9 Cyrl
Inuktitut, Eastern Canadian 1791082.3 6456-26.6 6456-26.5 6456-25.5 6456-25.7 Cans
Magahi 1792082.4 6950-21.0 6950-20.9 5090-41.3 6052-30.3 Deva
Uyghur (Arabic) 1832386.5 982611.7 982611.8 982613.4 982613.1 Arab
Greek (monotonic) 1832486.5 1001713.9 1001714.0 1001715.6 1001715.3 Grek
Cherokee (cased) 1875990.9 7245-17.6 7245-17.6 7245-16.4 7245-16.6 Cher
Cherokee (uppercase) 1875990.9 7245-17.6 7245-17.6 7245-16.4 7245-16.6 Cher
Bhojpuri 1893092.6 7294-17.1 7294-17.0 5217-39.8 6274-27.8 Deva
Greek (polytonic) 1955599.0 1003914.2 1003914.2 1003915.9 1003915.6 Grek
Maithili 20047104.0 7500-14.7 7500-14.7 5382-37.9 6435-25.9 Deva
Nepali 20816111.8 7720-12.2 7720-12.2 5338-38.4 6615-23.9 Deva
Bengali 21349117.2 7871-10.5 7871-10.4 5318-38.6 7061-18.7 Beng
Thai (2) 21694120.8 7390-16.0 7390-15.9 5896-32.0 5950-31.5 Thai
Thai 21873122.6 7479-15.0 7479-14.9 5992-30.8 6043-30.4 Thai
Gujarati 21890122.8 8184-6.9 8184-6.9 5586-35.5 6978-19.7 Gujr
Ashéninka, Pichis 22298126.9 22163152.0 22163152.2 22163155.8 22163155.1 Latn
Panjabi, Eastern 22584129.8 8788-0.1 87880.0 6181-28.7 7470-14.0 Guru
Sanskrit 22717131.2 8171-7.1 8171-7.0 5186-40.2 6544-24.7 Deva
Sinhala 22785131.9 8519-3.1 8519-3.1 6061-30.1 6853-21.1 Sinh
Khün 23411138.2 8047-8.5 8047-8.4 4655-46.3 5140-40.8 Lana
Kannada 23429138.4 8463-3.8 8463-3.7 5580-35.6 6989-19.6 Knda
Hindi 23466138.8 89621.9 89622.0 6187-28.6 7632-12.2 Deva
Lao 24128145.5 8340-5.2 8340-5.1 6365-26.5 6447-25.8 Laoo
Telugu 24993154.3 91454.0 91454.1 6027-30.4 7156-17.6 Telu
Khmer, Central 25053154.9 8619-2.0 8619-1.9 5511-36.4 6791-21.8 Khmr
Malayalam 25115155.6 89071.3 89071.4 5286-39.0 6762-22.2 Mlym
Marathi 25231156.8 93456.3 93456.3 6241-28.0 7939-8.6 Deva
Javanese (Javanese) 26155166.2 8741-0.6 8741-0.5 5207-39.9 6786-21.9 Java
Georgian 26534170.0 974210.8 974210.9 974212.4 974212.1 Geor
Chakma 27301177.8 1423161.8 7696-12.4 4883-43.6 5313-38.8 Cakm
Pular (Adlam) 28460189.6 1495170.0 8233-6.3 7435-14.2 7435-14.4 Adlm
Maldivian 28469189.7 1503070.9 1503071.0 8449-2.5 8449-2.8 Thaa
Dzongkha 28504190.1 96509.7 96509.8 7620-12.1 7620-12.3 Tibt
Mon 28674191.8 1001613.9 1001614.0 5751-33.6 6233-28.3 Mymr
Sanskrit (Grantha) 29914204.4 1541875.3 8241-6.2 5244-39.5 8173-5.9 Gran
Tamil 30208207.4 1082423.1 1082423.2 6894-20.4 92736.7 Taml
Tamil (Sri Lanka) 30213207.4 1082523.1 1082523.2 6893-20.5 92756.8 Taml
Tibetan, Central 30411209.5 1024316.5 1024316.6 7958-8.2 7958-8.4 Tibt
Burmese 35846264.8 1257243.0 1257243.1 7695-11.2 8630-0.7 Mymr
Shan 36130267.7 1255042.7 1255042.8 8327-3.9 8604-1.0 Mymr
Min 4170-57.6 2202-75.0 2202-74.9 2202-74.6 4007-53.9
Median98278794878886658688
Mean 1131515.1 88330.4 8787-0.0 8567-1.1 87000.1
Max (ignoring outlier) 35846264.8 1733697.1 1733697.3 17336100.1 1733699.5
Max 36130267.7 22163152.0 22163152.2 22163155.8 22163155.1