It’s Not Wrong that `"🤦🏼‍♂️".length == 7`

But It’s Better that `"🤦🏼‍♂️".len() == 17` and Rather Useless that `len("🤦🏼‍♂️") == 5`

From time to time, someone shows that in JavaScript the .length of a string containing an emoji results in a number greater than 1 (typically 2) and then proceeds to the conclusion that haha JavaScript is so broken—and is rewarded with many likes. In this post, I will try to convince you that ridiculing JavaScript for this is less insightful than it first appears and that Swift’s approach to string length isn’t unambiguously the best one. Python 3’s approach is unambiguously the worst one, though.

What’s Going on with the Title?

"🤦🏼‍♂️".length == 7 evaluates to true as JavaScript. Let’s try JavaScript console in Firefox:

"🤦🏼‍♂️".length == 7
true

Haha, right? Well, you’ve been told that the Python community suffered the Python 2 vs. Python 3 split, among other things, to Get Unicode Right. Let’s try Python 3:

$ python3
Python 3.6.8 (default, Jan 14 2019, 11:02:34) 
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> len("🤦🏼‍♂️") == 5
True
>>>

OK, then. Now, Rust has the benefit of learning from languages that came before it. Let’s try Rust:

$ cargo new -q length
$ cd length
$ echo 'fn main() { println!("{}", "🤦🏼‍♂️".len() == 17); }' > src/main.rs
$ cargo run -q
true

That’s better!

What?

The string contains a single emoji consisting of five Unicode scalar values:

Unicode scalar	UTF-32 code units	UTF-16 code units	UTF-8 code units	UTF-32 bytes	UTF-16 bytes	UTF-8 bytes
U+1F926 FACE PALM	1	2	4	4	4	4
U+1F3FC EMOJI MODIFIER FITZPATRICK TYPE-3	1	2	4	4	4	4
U+200D ZERO WIDTH JOINER	1	1	3	4	2	3
U+2642 MALE SIGN	1	1	3	4	2	3
U+FE0F VARIATION SELECTOR-16	1	1	3	4	2	3
Total	5	7	17	20	14	17

The string that contains one graphical unit consists of 5 Unicode scalar values. First, there’s a base character that means a person face palming. By default, the person would have a cartoonish yellow color. The next character is an emoji skintone modifier the changes the color of the person’s skin (and, in practice, also the color of the person’s hair). By default, the gender of the person is undefined, and e.g. Apple defaults to what they consider a male appearance and e.g. Google defaults to what they consider a female appearance. The next two scalar values pick a male-typical appearance specifically regardless of font and vendor. Instead of being an emoji-specific modifier like the skin tone, the gender specification uses an emoji-predating gender symbol (MALE SIGN) explicitly ligated using the ZERO WIDTH JOINER with the (skin-toned) face-palming person. (Whether it is a good or a bad idea that the skin tone and gender specifications use different mechanisms is out of the scope of this post.) Finally, VARIATION SELECTOR-16 makes it explicit that we want a multicolor emoji rendering instead of a monochrome dingbat rendering.

Each of the languages above reports the string length as the number of code units that the string occupies. Python 3 strings store Unicode code points each of which is stored as one code unit by CPython 3, so the string occupies 5 code units. JavaScript (and Java) strings have (potentially-invalid) UTF-16 semantics, so the string occupies 7 code units. Rust strings are (guaranteed-valid) UTF-8, so the string occupies 17 code units. We’ll come to back to the actual storage as opposed to semantics later.

Note about Python 3 added on 2019-09-09: Originally this article claimed that Python 3 guaranteed UTF-32 validity. This was in error. Python 3 guarantees that the units of the string stay within the Unicode code point range but does not guarantee the absence of surrogates. It not only allows unpaired surrogates, which might be explained by wishing to be compatible with the value space of potentially-invalid UTF-16, but Python 3 allows materializing even surrogate pairs, which is a truly bizarre design. The previous conclusions stand with the added conclusion that Python 3 is even more messed up than I thought! With the way the example string was constructed in Python 3, the Python 3 string happens to match the valid UTF-32 representation of the string, so it is still illustrative of UTF-32, but the rest of the article has been slightly edited to avoid claiming that Python 3 used UTF-32.

But I Want the Length to Be 1!

There’s a language for that. The following used Swift 4.2.3, which was the latest release when I was researching this, on Ubuntu 18.04:

$ mkdir swiftlen
$ cd swiftlen/
$ swift package init -q --type executable
$ swift package init --type executable
Creating executable package: swiftlen
Creating Package.swift
Creating README.md
Creating .gitignore
Creating Sources/
Creating Sources/swiftlen/main.swift
Creating Tests/
Creating Tests/LinuxMain.swift
Creating Tests/swiftlenTests/
Creating Tests/swiftlenTests/swiftlenTests.swift
Creating Tests/swiftlenTests/XCTestManifests.swift
$ echo 'print("🤦🏼‍♂️".count == 1)' > Sources/swiftlen/main.swift 
$ swift run swiftlen 2>/dev/null
true

(Not using the Swift REPL for the example, because it does not appear to accept non-ASCII input on Ubuntu! Swift 5.0.3 prints the same and the REPL is still broken.)

OK, so we’ve found a language that thinks the string contains one countable unit. But what is that countable unit? It’s an extended grapheme cluster. (“Extended” to distinguish from the older attempt at defining grapheme clusters now called legacy grapheme clusters.) The definition is in Unicode Standard Annex #29 (UAX #29).

The Lengths Seen So Far

We’ve seen four different lengths so far:

Number of UTF-8 code units (17 in this case)
Number of UTF-16 code units (7 in this case)
Number of UTF-32 code units or Unicode scalar values (5 in this case)
Number of extended grapheme clusters (1 in this case)

Given a valid Unicode string and a version of Unicode, all of the above are well-defined and it holds that each item higher on the list is greater or equal than the items lower on the list.

One of these is not like the others, though: The first three numbers have an unchanging definition for any valid Unicode string whether it contains currently assigned scalar values or whether it is from the future and contains unassigned scalar values as far as software written today is aware. Also, computing the first three lengths does not involve lookups from the Unicode database. However, the last item depends on the Unicode version and involves lookups from the Unicode database. If a string contains scalar values that are unassigned as far as the copy of the Unicode database that the program is using is aware, the program will potentially overcount extended grapheme clusters in the string compared to a program whose copy of the Unicode database is newer and has assignments for those scalar values (and some of those assignments turn out to be combining characters).

More Than One Length per Programming Language

It is not the case that a given programming language has to choose only one of the above. If we run this Swift program:

var s = "🤦🏼‍♂️"
print(s.count)
print(s.unicodeScalars.count)
print(s.utf16.count)
print(s.utf8.count)

it prints:

Let’s try Rust with unicode-segmentation = "1.3.0" in Cargo.toml:

use unicode_segmentation::UnicodeSegmentation;

fn main() {
	let s = "🤦🏼‍♂️";
	println!("{}", s.graphemes(true).count());
	println!("{}", s.chars().count());
	println!("{}", s.encode_utf16().count());
	println!("{}", s.len());
}

The above program prints:

That’s unexpected! It turns out that unicode-segmentation does not implement the latest version of the Unicode segmentation rules, so it gives the ZERO WIDTH JOINER generic treatment (break right after ZWJ) instead of the newer refinement in the emoji context.

Let’s try again, but this time with unic-segment = "0.9.0" in Cargo.toml:

use unic_segment::Graphemes;

fn main() {
	let s = "🤦🏼‍♂️";
	println!("{}", Graphemes::new(s).count());
	println!("{}", s.chars().count());
	println!("{}", s.encode_utf16().count());
	println!("{}", s.len());
}

In the Rust case, strings (here mere string slices) know the number of UTF-8 code units they contain. The len() method call just returns this number that has been stored since the creation of the string (in this case, compile time). In the other cases, what happens is the creation of an iterator and then instead of actually examining the values (string slices correspoding to extended grapheme clusters, Unicode scalar values or UTF-16 code units) that the iterator would yield, the count() method just consumes the iterator and returns the number of items that were yielded by the iteration. The count isn’t stored anywhere on the string (slice) afterwards. If we wanted to later know the counts again, we’d have to iterate over the string again.

Know in Advance or Compute When Needed?

This introduces a notable question in the design space: Should a given type of length quantity be eagerly computed when the string is created? Or should the length be computed when someone asks for it? Or should it be computed when someone asks for it and then automatically stored on the string object so that it’s available immediately if someone asks for it again?

The answer Rust has is that the length in the code units of the Unicode Encoding Form of the language is stored upon string creation, and the rest are computed when someone asks for them (and then forgotten and not stored on the string).

Swift is a higher-level language and doesn’t document the exact nature of its string internals as part of the API contract. In fact, the internal representation of Swift strings changed substantially between Swift 4.2 and Swift 5.0. It’s not documented if different views to the string are held onto once created, for example. The documentation does say that strings are copy-on-write, so the first mutation may involve copying the string’s storage.

Notably, the design space includes not remembering anything. The C programming language is a prominent example of this case. C strings don’t even remember their number of code units. To find out the number of code units, you have to iterate over the string until a sentinel value. In the case of C, the sentinel is the code unit for U+0000, so it excludes one Unicode scalar value from the possible string contents. However, that’s not a strictly necessary property of a sentinel-based design that doesn’t remember any lengths. 0xFF does not occur as a code unit in any valid UTF-8 string and 0xFFFFFFFF does not occur in any valid UTF-32 string, so they could be used as sentinels for UTF-8 and UTF-32 storage, respectively, without excluding a scalar value from the Unicode value space. There is no 16-bit value that never occurs in a valid UTF-16 string. However, a valid UTF-16 string does not contain unpaired surrogates, so an unpaired low surrogate could, in principle, be used as a sentinel in a design that wanted to use guaranteed-valid UTF-16 strings that don’t remember their code unit length.

Knowing the Storage-Native Code Unit Length is Extremely Reasonable

The length of the string as counted in code units of its storage-native Unicode Encoding Form (i.e. whichever of UTF-8, UTF-16, and UTF 32 the programming language has chosen for its string semantics) is not like the other lengths. It is the length that the implementation cannot avoid having to know at the time of creating a new string, because it is the length that is required to be known in order to be able to allocate storage for a string. Even C, which promptly forgets about the code unit length in the storage-native Unicode Encoding Form after string has been created, has to know this length when allocating storage for a new string.

That is, the design decision is about whether to remember this length. It is not about whether to compute it eagerly. You just have to have it at string creation time—i.e. eagerly.

Considering that remembering this quantity makes string concatenation, which is a common operation, substantially faster to implement compared to not remembering this quantity, remembering this quantity is fundamentally reasonable. Also, it means that you don’t need to maintain a sentinel value, which means that a substring operation can yield results that share the buffer with the original string instead of having to copy in order to be able to insert sentinel. (Note that you can easily foil this benefit if you wish to eagerly maintain zero-termination for the sake of C string compatibility.)

What About Knowing the Other Lengths?

Even if we’ve established that it makes sense for string implementation to remember the storage length of the string in code units all the storage-native Unicode encoding form, it doesn’t answer whether a string implementation should also remember other lengths or which kind of length should be offered in the most ergonomic API. (As we see above, Swift makes the number of extended grapheme clusters more ergonomic to obtain that the code unit or scalar value length.)

Also, if any other length is to be remembered, there is the question of whether it should be eagerly computed as string creation time or lazily computed the first time someone asks for it. It is easy to see why at least the latter does not make sense for multi-threaded systems-programming language like Rust. If some properties of an object are lazily initialized, in a multi-threaded case you also need to solve synchronization of these computations. Furthermore, you need to allocate space at least for a pointer to auxiliary information if you want to be able to add auxiliary information later or you need to have a hashtable of auxiliary information where the string the information is about is the key, so auxiliary information, even when not present, has storage implications or implications of having to have global state in a run-time system. Finally, for systems programming, it may be more desirable to know the time complexity of a given operation clearly even if it means “always O(n)” instead of “possibly O(n) but sometimes O(1)”. Even if the latter looks strictly better, it is less predictable.

For a higher-level language, arguments from space requirements or synchronization issues might not be decisive. It’s more relevant to consider what a given length quantity is used for. This is often forgotten in Internet debates that revolve around what length is the most “correct” or “logical” one. So for the lengths that don’t map to the size of storage allocation, what are they good for?

It turns out that in the Firefox code base there are two places where someone wants to know the number of Unicode scalar values in a string that is not being stored as UTF-32 and attention is not paid to what the scalar values actually are. The IETF specification for Session Traversal Utilities for NAT (STUN) used for WebRTC has the curious property that it places length limits on certain protocol strings such that the limits are expressed as number of Unicode scalar values but the strings are transmitted in UTF-8. Firefox validates these limits. (The limit looks like an arbitrary power-of-two (128 scalar values). The spec has remarks about the possible resulting byte length, which was wrong according to the IETF UTF-8 RFC that was current and already nearly five years old at the time of publication of the STUN RFC. Specifically, the STUN RFC repeatedly says that 128 characters as UTF-8 may be as long as 763 bytes. To arrive at that number, you have to assume that a UTF-8 character can be up to six bytes long, as opposed to up to 4 bytes long as in the prevailing UTF-8 RFC and in the Unicode Standard, and that the last character of the 128 is a zero terminator and, therefore, known to take just one byte.) In this case, the reason for wishing to know a non-storage length is to impose a limit. The other case is reporting the column number for the source location of JavaScript errors.

Length limits, which we’ll come back to, probably aren’t a frequent enough a use case to justify making strings know a particular kind of length as opposed to such length being possible to compute when asked for. Neither are error messages.

Another use case for asking for a length is iterating by index and using the length as the loop termination condition 1990s Java style. Like this:

for (int i = 0; i < s.length(); i++) {
    // Do something with s.charAt(i)
}

In this case, it’s actually important for the length to be precomputed number on the string object. This use case is coupled with the requirement that indexing into the string to find the nth unit corresponding to the count of units that the “length” represents should be a fast operation.

The above pattern is a lot less conclusive in terms of what lengths should be precomputed (and what the indexing unit should be) than it first appears. The above loop doesn’t do random access by index. It sequentially uses every index from zero up to, but not including, length. Indeed, especially when iterating over a string by Unicode scalar value, typically when you examine the contents of a string, you iterate over the string in order. Programming languages these days provide an iterator facility for this, and e.g. to iterate over a UTF-8 string by scalar value, the iterator does not need to know the number of scalar values up front. E.g. in Rust, you can do this in O(n) time despite string slices not knowing their number of Unicode scalar values:

for (c in s.chars()) {
    // Do something with c
}

(Note that char is an 8-bit code unit (possibly UTF-8 code unit) in C and C++, char is a UTF-16 code unit in Java, char is a Unicode scalar value in Rust, and Character is an extended grapheme cluster in Swift.)

A programming language together with its library ecosystem should provide iteration over a string by Unicode scalar value and by extended grapheme cluster, but it does not follow that strings would need to know the scalar value length or the extended grapheme cluster length up front. Unlike the code unit storage length, those quantities aren’t useful for accelerating operations like concatenation that don’t care about the exact content of the string.

Which Unicode Encoding Form Should a Programming Language Choose?

The observation that having strings know their code unit length in their storage-native Unicode encoding form is extremely reasonable does not answer how many bits wide the code units should be.

The usual way to approach this question is to argue that UTF-32 is the best, because it provides O(1) indexing by “character” in the sense of a character meaning a Unicode scalar value, or the argument focuses on whether UTF-8 is unfair to some languages relative to UTF-16. I think these are bad ways to approach this question.

First of all, the argument that the answer should be UTF-32 is bad on two counts. First, it assumes that random access scalar value is important, but in practice it isn’t. It’s reasonable to want to have a capability to iterate over a string by scalar value, but random access by scalar value is in the YAGNI department. Second, arguments in favor of UTF-32 typically come at a point where the person making the argument has learned about surrogate pairs in UTF-16 but has not yet learned about extended grapheme clusters being even larger things that the user perceives as unit. That is, if you escape the variable-width nature of UTF-16 to UTF-32, you pay by doubling the memory requirements and extended grapheme clusters are still variable-width.

I’ll come back to the length fairness issue later, but I think a different argument is much more relevant in practice for the choice of in-memory Unicode encoding form. The more relevant argument is this: Implementations that choose UTF-8 actually accept the UTF-8 storage requirements. When wider-unit semantics are chosen for a language that doesn’t provide raw memory access and, therefore, has the opportunity to tweak string storage, the implementations try to come up with ways to avoid actually paying the cost of the wider units in some situations.

JavaScript and Java strings have the semantics of potentially-invalid UTF-16. SpiderMonkey and V8 implement an optimization for omitting the leading zeros of each code unit in a string, i.e. storing the string as ISO-8859-1 (the actual ISO-8859-1, not the Web notion of “ISO-8859-1” as a label of windows-1252), when all code units in the string have zeros in the most-significant half. The HotSpot JVM also implements this optimization, though enabling it is optional. Swift 4.2 implements a slightly different variant of the same idea, where ASCII-only strings are stored as 8-bit units and everything else is stored as UTF-16. CPython since 3.3 makes the same idea three-level with code point semantics: Strings are stored with 32-bit code units if at least one code point has a non-zero bit above the low 16 bits. Else if a string has a non-zero bits above the low 8 bits for at least one code point, the string is stored as 16-bit units. Otherwise, the string is stored as 8-bit units (Latin1).

I think the unwillingness of implementations of languages that have chosen UTF-16 or UTF-32 (or UTF-32-ish as in the case of Python 3) string semantics to actually use UTF-16 or UTF-32 storage when they can get away with not using actual UTF-16 or UTF-32 storage is the clearest indictment against UTF-16 or UTF-32 (and other wide-unit semantics like what Python 3 uses).

Languages that choose UTF-8, on the other hand, stick to actual UTF-8 for the purpose of storing Unicode scalar values. When languages that choose UTF-8 deviate from UTF-8, they do so in order to represent values that are not Unicode scalar values for compatibility with external constraints. Rust uses a representation called WTF-8 for file system paths on Windows. All UTF-8 strings are WTF-8 strings, but WTF-8 can also represent unpaired surrogates for compatibility with Windows file paths being sequences of 16-bit units that can contain unpaired surrogates. Perl 6 uses an internal representation called UTF-8 Clean-8 (or UTF8-C8), which represents strings that consist of Unicode scalar values in Unicode Normalization Form C the same way as UTF-8 but represents non-NFC content differently and can represent sequences of bytes that are not valid UTF-8.

UTF-8 is the only one of the Unicode encoding forms that is also a Unicode encoding scheme, and of the Unicode encoding schemes, UTF-8 has clearly won for interchange. (Unicode encoding forms are what you have in RAM, so UTF-16 consists of native-endian, two-byte-aligned 16-bit code units. Unicode encoding schemes are what can be used for byte-oriented interchange, so e.g. UTF-16LE consist of 8-bit code units every pair of which form a potentially-unaligned little-endian 16-bit number, which in turn may form a surrogate pair.) When UTF-8 is used as the in-RAM representation, input and output operations are less expensive than with UTF-16 or UTF-32. UTF-16 or UTF-32 in RAM requires conversion from UTF-8 when reading input and conversion to UTF-8 when writing output. A system that guarantees UTF-8 validity internally, such as Rust, needs only to validate UTF-8 upon reading input and no conversion is needed when writing output. (Go takes a garbage in, garbage out approach to UTF-8: input is not validated at input time and output is written without conversion. However, iteration by scalar value can yield REPLACEMENT CHARACTERs when iterating over invalid UTF-8. That is, the input step is less expensive than in Rust, but iterating by scalar value is marginally more expensive. The output step is less correct.)

Finally, in terms of nudging developers to write correct code, UTF-8 has the benefit of being blatantly variable-width, so even with languages such as English, Somali, and Swahili, as soon as you have a dash or a smart quote, the variable-width nature of UTF-8 shows up. In this context, extended grapheme clusters are just extending the variable-width nature. Meanwhile, UTF-16 allows programmers to get too far while pretending to be working with something where the units they need to care about are fixed-width. Reacting to surrogate pairs by wishing to use UTF-32 instead is a bad idea, because if you want to write correct software, you still need to deal with variable-width extended grapheme clusters.

The choice of UTF-32 (or Python 3-style code point sequences) arises from wanting the wrong thing. The choice of UTF-16 is a matter of early-adopter legacy from the time when Unicode was expected to be capped to 16 bits of code space and, once UTF-16 has been committed to, not breaking compatibility with already-written programs is important and justified the continued use of UTF-16, but if you aren’t bound by that legacy and are designing a new language, you should go with UTF-8. Occasionally even systems that appear to be bound by the UTF-16 legacy can break free. Even though Swift is committed to interoperability with Cocoa, which uses UTF-16 strings, Swift 5 switched to UTF-8 for Swift-native strings. Similarly, PyPy has gone UTF-8 despite Python 3 having code point semantics.

Shouldn’t the Nudge Go All the Way to Extended Grapheme Clusters?

Even if we accept that the storage should be UTF-8 and that the string implementation should maintain knowledge of the string length in UTF-8 code units, if the blatant variable-widthness of UTF-8 is argued to be a nudge toward dealing with the variable-widthness of extended grapheme clusters, shouldn’t the Swift approach of making extended grapheme cluster access and count the view that takes the least ceremony to use be the thing that every language should do?

Swift is still too young to draw definitive conclusions from. It’s easy to believe that the Swift approach nudges programmers to write more extended grapheme cluster-correct code and that the design makes sense for a language meant primarily for UI programming on a largely evergreen platform (iOS). It isn’t clear, though, that the Swift approach is the best for everyone.

Earlier, I said that the example used “Swift 4.2.3 on Ubuntu 18.04”. The “18.04” part is important! Swift.org ships binaries for Ubuntu 14.04, 16.04, and 18.04. Running the program

var s = "🤦🏼‍♂️"
print(s.count)
print(s.unicodeScalars.count)
print(s.utf16.count)
print(s.utf8.count)

in Swift 4.2.3 on Ubuntu 14.04 prints:

So Swift 4.2.3 on Ubuntu 18.04 as well as the unic_segment 0.9.0 Rust crate counted one extended grapheme cluster, the unicode-segmentation 1.3.0 Rust crate counted two extended grapheme clusters, and the same version of Swift, 4.2.3, but on a different operating system version counted three extended grapheme clusters!

Swift 4 delegates Unicode segmentation to operating system-provided ICU, and “Long-Term Support” in the Ubuntu case means security patches but does not mean rolling forward the Unicode version that the system copy of ICU knows about. In the case of iOS, delegating to system ICU is probably OK and will not lead to too high probability of the text being from the future from the point of view of the OS copy of ICU, since the iOS ecosystem stays exceptionally well up-to-date. However, delegating to system ICU is not such a great match for the idea of using Swift on the server side if the server side means running an old LTS distro.

(Swift 5 appears to no longer use system ICU for this. That is, Swift 5.0.3 on Ubuntu 14.04 sees one extended grapheme cluster in the string. I haven’t investigated what Swift 5 uses, but I assume that the switch to UTF-8 string representation necessitated using something other than ICU, which is heavily UTF-16-oriented. However, the result with Swift 4.2.3 nicely illustrates the issue related to using extended grapheme clusters.)

If you are doing things that have to be extended grapheme cluster-aware, there just is no way around the issue of not being able to correctly segment text that comes from the future relative to the Unicode segmentation implementation that your program is using. This is not a reason to avoid extended grapheme clusters for tasks that require awareness of extended grapheme clusters.

However, pushing extended grapheme clusters onto tasks that do not really require the use of extended grapheme cluster introduces failure modes arising from the Unicode version dependency where such a dependency isn’t strictly necessary. For example, the Unicode version dependency of extended grapheme clusters means that you should never persist indices into a Swift strings and load them back in a future execution of your app, because an intervening Unicode data update may change the meaning of the persisted indices! The Swift string documentation does not warn against this.

You might think that this kind of thing is a theoretical issue that will never bite anyone, but even experts in data persistence, the developers of PostgreSQL, managed to make backup restorability dependent on collation order, which may change with glibc updates.

Let’s consider other languages a bit.

C++ is often deployed such that the application developer doesn’t ship the standard library with the program. Most obviously, relying on GNU libstdc++ provided by an LTS Linux distribution presents similar problems as Swift 4 relying on ICU provided by an LTS Linux distribution. This isn’t a Linux-specific issue. Old supported branches of Windows generally don’t get new system-level Unicode data, either. Even though there is some movement towards individual applications shipping their own copy of LLVM libc++ with the application and the increased pace of C++ standard development starting with C++11 has made using a system-provided C++ standard library more problematic even ignoring Unicode considerations, it doesn’t seem like a good idea for C++ to develop a tight coupling with extended grapheme clusters for operations that don’t strictly necessitate it as longs as stuck-in-the-past system libraries (whether the C++ standard library itself or another library that it delegates to) are a significant part of the C++ standard library distribution practice.

There’s a proposal to expose extended grapheme cluster segmentation to JavaScript programs. The main problem with this proposal is the implication on APK sizes on Android and the effect of APK sizes on browser competition on Android. But if we ignore that for the moment and imagine this was part of the Web Platform, it would still be problematic to build this dependency into operations for which working on extended grapheme clusters isn’t strictly necessary. While the most popular browsers are evergreen, there’s still a long tail of browser instances that aren’t on the latest engine versions. When JavaScript executes on such browsers, there’d be effects similar to running Swift 4 on Ubuntu 14.04.

In contrast to C++ or JavaScript, the current Rust approach is to statically link all Rust library code, including the standard library, into the executable program. This means that the application distributor is in control of library versions and doesn’t need to worry about the program executing in the context of out-of-date Rust libraries. The flip side is concerns about the size of the executable. People already (rightly or wrongly) complain about the sizes of Rust executables. Pulling in a lot of Unicode data due to baking extended grapheme cluster processing into programs whose problem domain doesn’t strictly require working with extended grapheme clusters would be problematic in embedded contexts where the executable size is a real problem and not just a perceived problem—and would obviously make the perceived problem worse, too. Furthermore, in order to avoid problems similar to those involved in relying on system libraries, baking tight coupling with Unicode data into the standard library necessitates the organizational capability of keeping up with new Unicode versions in this area where not only data in the tables keeps changing but the format of the tables and, therefore, the associated algorithms have still been changing recently. Right now of the two extended grapheme cluster crates outside the Rust standard library, the one that’s organizationally closer to the standard library is the one that’s out of date.

Why Do We Want to Know Anyway?

“String length is about as meaningful a measurement as string height” – @qntm

Being able to allocate memory for strings gives a legitimate use case for knowing the storage length. However, in cases of Unicode scalar values or extended grapheme clusters, you typically want to iterate over them and look at each one instead of just knowing the count. So why do people want to know the count? As far as I can tell, there are two broad categories: Placing a quota limit that is fuzzy enough that it doesn’t need to be strictly tied to storage and trying to estimate how much text fits for display. Let’s look at the issue of estimating how much display space text takes, because it involves introducing yet another measurement of string length.

Display Space

Simply looking at the Latin letters i and m should make it clear that the display size of a string depends on the font and on the specific characters in the string. From this observation, the whole notion of estimating display space by counting characters seems folly. Indeed, if you want to know exactly how much text fits into a given space, you need to run a typesetting algorithm with a specific font, which may have a complex relationship between scalar values and glyphs, to actually see where the overflow starts. Yet, even in the case of the Latin script that has letters such as i and m, e.g. magazine editors can find character counts useful enough for estimating how many print pages an article of a given character count length is going to fill.

As for computer user interfaces, character terminal user interfaces use a monospaced font where both i and m take up one character cell on a grid. In the context of a monospaced font, the extended grapheme cluster count in the context of the Latin script corresponds directly to display space taken. The same obviously applies to the Greek and Cyrillic scripts, which are so close to the Latin script that fonts even intend to reuse glyphs across these scripts. In contrast, CJK ideographs, Japanese kana, and Hangul syllables take two cells of a terminal grid. From the CJK perspective, these are full-width characters and the ASCII characters are half-width characters. There exist also half-width katakana characters which fit into an 8-bit encoding with ASCII and take one cell on the terminal grid and, therefore, are technically easier to fit to Latin script-oriented terminal systems. The display width on a terminal also has a correspondence to byte with the legacy CJK encodings: ASCII takes one byte, a CJK ideograph, a full-width kana or a Hangul syllable takes two bytes. In the case of Shift_JIS, half-width katakana takes one byte per character.

This brings us to the concept of East Asian Width. ASCII and half-width katakana characters are narrow. CJK ideographs, full-width kana, and Hangul syllables are wide. However, even in the worldview that is split to Latin, Greek, and Cyrillic on one hand and Chinese, Japanese, and Korean on the other hand, there are ambiguities. From the perspective of European legacy encodings, Greek and Cyrillic (as well as accented Latin) is equally wide as ASCII. However, in legacy CJK encodings, Greek and Cyrillic characters take two bytes. This means that in terms of East Asian Width, a string can have a general-purpose width, which resolves these ambiguous characters as narrow, or legacy CJK-context width, which resolves these ambiguous characters as wide.

So is the general-purpose variant (that resolves Greek and Cyrillic characters as narrow) of East Asian Width the one true string length measure? Well, no.

First of all, the concept ignores all scripts that are geographically and in Unicode order between Latin, Greek, and Cyrillic on one hand and CJK on the other (even though some other scripts that are structurally similar to the Latin, Greek, and Cyrillic scripts and make sense for a monospaced font, such as Armenian and the Georgian scripts, fit this concept, too, despite not having a history in pre-Unicode CJK context). As it happens, though, emoji do fit into the concept, except for weird errors in the Unicode database. After all, emoji originate from Japan and were two bytes each when represented using the private use area of Shift_JIS.

Second, the concept assumes that there is one-to-one correspondence between scalar values and extended grapheme clusters. If we run this Rust program:

use unicode_width::UnicodeWidthStr;

fn main() {
    println!("{}", "🤦🏼‍♂️".width());
}

it prints:

This is because the base emoji is wide (2), the combining skin tone modifier is also wide (2), the male sign is counted as narrow (1), and the zero-width joiner and the variation selector are treated as control characters that don’t count towards width. Obviously, this is not the answer that we want. The answer we want is 2. Ideas that come to mind immediately, such as only counting the width of the first character in an extended grapheme cluster or taking the width of the widest character in an extended grapheme cluster, don’t work, because flag emoji consist of two regional indicator symbol letter characters both of which have East Asian Width of Neutral (i.e. they are counted as narrow but are not marked as narrow, because they are considered to exist outside the domain of East Asian typography). I’m not aware of any official Unicode definition that would reliably return 2 as the width of every kind of emoji. 😭

If you really must estimate display size without running text layout with a font, whether the extended grapheme cluster count or the East Asian Width of the string works better depends on context.

Arbitrary but Fair Quotas

In some cases there is a desire to impose a length limit that doesn’t arise from a strict storage limitation. For example, in the STUN protocol given earlier, presumably there is a desire to make it so that human-readable error messages cannot make protocol messages arbitrarily long. For example, in the case of Twitter, tweets being short is a core part of the type of expression that Twitter is about, so some definition of “short” is needed. In the case of string-based browser localStorage, there is a need to have some limit, but the limit is necessarily arbitrary and does not need to strictly map to bytes on disk.

In cases like this, there seems to be some concern that the limit should be internationally fair. Observations that UTF-8 and UTF-16 take a different amount of storage per character depending on the character superficially suggests that the UTF-8 length or the UTF-16 length might be unfair internationally.

What’s fair, though? The usual concern goes that UTF-8 favors English, because English takes one byte per character, and disfavors CJK, because Chinese, Japanese, and Korean take three bytes per character, so UTF-8 in unfair to CJK. This kind of analysis ignores how much information is conveyed per character. To assess what lengths we get for different languages when the amount of information conveyed is kept constant, I looked at the counts for the translations of the Universal Declaration of Human Rights. This is a document for which translation of the same content is available in particularly many languages, which is why I used it as the measurement corpus.

Unfortunately, not all translations contain the same text, so one needs to be careful when preparing the data for comparison. Some translations are incomplete, in some cases, very incomplete. For this reason, I included only translations in stage 4 or stage 5 along the 5-stage scale. Some translations carry the preamble with the recitals, but some do not. Some also carry historical notes. To make the length comparable, the preamble, notes, and whitespace-only text nodes were omitted. The rest of the XML text nodes were concatenated and normalized to Unicode Normalition Form C before counting. (Source code is available.)

Let’s look at the result. The table at the end of this document is sortable and is initially sorted by UTF-8 length. Each Δ% column shows how much the count in the column to its left deviates from the median count for that. (A note about color-coding. Coloring longer than median as red should not be taken to imply that those languages are somehow bad. It’s meant to imply that a length quota treats those languages badly.) In the table, the name of each language links to the translation in that language hosted on the site of the Unicode Consortium. The linked HTML versions may include the preamble and/or notes.

The CJK concern is alleviated when considering information conveyed. When measuring UTF-8 length, Mandarin using traditional characters is the shortest of the languages that have global name recognition! This should be expected, since the Han script pretty obviously packs more information per character than e.g. alphabetic scripts. (The globally less-known languages whose UTF-8 length is shorter than Mandarin’s (using traditional characters) are African and American Latin-script languages with a relatively small native speaker population for each—only one with a native speaker population exceeding a million and many whose native speaker population is smaller than 100 000, which explains why you might not recognize their names.)

Korean is also shorter than median in UTF-8 length. This also makes sense, since Hangul syllables pack three or two alphabetic jamo into one three-byte character. The UTF-8 length of Japanese is over median but only by 4.1%. The Japanese version of the text is 48% kanji and 52% hiragana. Japanese Wikipedia has almost the same kana to kanji ratio, though different kana: 46% kanji and the rest almost evenly split between hiragana and katakana, so we may assume the Universal Declaration of Human Rights to be representative of Japanese text in terms of kana to kanji ratio.

When sorting by UTF-16 code unit count, UTF-32 / scalar value count, or extended grapheme cluster count, CJK are the shortest. While it’s true that UTF-8 takes more bytes for CJK than UTF-16, the notion of UTF-8 being particularly disfavorable to CJK is not true relative to other languages. Rather, UTF-16 is particularly favorable to CJK. In particular, the Han script is so information-dense that even when sorting by East Asian Width, which effectively doubles the length of CJK but not other languages, Han-script languages stay clustered at the start of the table. Korean and Japanese move further but remain below median.

The language with the longest UTF-8 length is Shan, which uses the Burmese script. The Burmese language, also using the Burmese script, is the second-longest in UTF-8 length. There are a number of other Brahmic-script languages among the ones with the longest UTF-8 length. They use three bytes per character but don’t have CJK-like information per character density. These languages are below median in extended grapheme cluster count. In scalar value count, they intermingle with alphabetic languages.

It’s not clear if the concepts of median and mean (average) are meaningful. Does it make sense for a language with tens of millions of native speakers to count as an equal data point as a language with tens of thousands native speakers? Since this is about writing, should the numbers of writers be considered instead? (I.e. should literacy rates be taken into account?) In the hope that with a large number of languages in the table, median hand-wavily sorts out this kind of issue, I chose to compare with median. At least the Han-script languages have comparable numbers of native speakers as the Bhramic-script languages and provide a counter-weight at the other end of the spectrum of UTF-8 length. In any case, for measures other than UTF-8 length, median and mean are very close to each other.

Saying that Brahmic-script languages intermingle with alphabetic languages in character count is rather meaningless, though. In character count, after CJK (and Han-script Vietnamese and Yi-script Nousu), the language with the smallest character count is a Latin-script language (Waama). Also, the language with the largest character count is a Latin-script language (Ashéninka, Pichis). (I find it odd that in UTF-8 length Ashéninka Perené is the second-shortest but Ashéninka, Pichis is long enough to reach the Brahmic cluster. I don’t know what the relation of these two languages is and what explains two languages whose name suggests close relation ending up in opposite extremes in length. Update: It has been pointed out to me that the supposed Ashéninka Perené translation is a mislabeled duplicate of the Cashinahua translation.)

One might hypothesize that the Latin script has just been put to so many uses that some of the uses have to be far from what it has been optimized for. Yet, when considering language-specific alphabets, the character counts for Greek and Georgian are above median. It just is the case that languages are different. In that sense, the whole notion of trying to find a simple length measure that is fair across languages seems folly.

Let’s look at the the factor between the minimum and maximum of each measure, i.e. the factor with which the minimum needs to be multiplied to get the maximum. Let’s even ignore the outlier for maximum for each measure and use the second largest value instead of the largest value for each count. (Otherwise, Ashéninka, Pichis alone would skew the numbers a lot.) We get these factors:

UTF-8	8.6
UTF-16	7.9
UTF-32	7.9
EGC	7.9
EAW	4.3

UTF-16, UTF-32, and extended grapheme clusters aren’t distinguished by this measure, because the languages at the extremes use characters from the Basic Multilingual Plane with one character per grapheme cluster. Considering that there are supplementary-plane scripts, arguably the UTF-32 count would be fairer than the UTF-16 count even though this factor doesn’t show the difference. It’s not clear that counting extended grapheme clusters would be particularly fair compared to counting characters: It favors scripts that are visually joining over scripts that aren’t visually joining even if there’s no logical difference. While looking at just the factor, East Asian Width makes the gap the smallest, but it’s a rather imprecise fairness solution. It just counts CJK as double. Even after this, the Han-script languages are still among the ones with the smallest counts. On the other hand, it seems unfair to recognize Hangul syllables and kana as carrying more information than an alphabetic character while not giving the same treatment to other syllabaries, such as the Ethiopic script, Ge’ez.

Twitter counts each CJK character (including three-jamo Hangul syllables; i.e. it is not decomposing Hangul and treating it as alphabetic) as consuming 2 units of the quota (as when counting East Asian Width), counts emoji as consuming two units (even when East Asian Width of the cluster would be more), and, unlike East Asian Width, counts each Ethiopic syllable as consuming two units of the quota. What Twitter does seems fairer than just applying East Asian Width, but the result is still that the amount of information that can be packed in a tweet can vary four-fold depending on language. That still doesn’t seem exactly fair across languages.

In closing:

There is no simple measure of string length that would be fair in terms of how much information can be conveyed within a length quota regardless of language.
Of solutions that don’t depend on the Unicode database and, therefore, the Unicode version and that don’t ad-hoc hard-code character ranges according to a particular version of Unicode, counting characters aka. scalar values i.e. UTF-32 length is the best that can be done. It’s still wildly unfair leading to almost eight-fold differences in how much information can be conveyed. This is not a flaw of Unicode but arises from differences in languages and writing systems.
While counting scalar values is fairer than just counting UTF-8 or UTF-16 code units, the factor between minimum and maximum UTF-8 length is so close to the factor between minimum and maximum UTF-32 length, both of which are pretty large, that instead of putting thought into using the scalar value length instead of the UTF-8 length or the UTF-16 length, it’s probably better to put the thought into reconsidering if you really need to impose such a limit.
Unicode doesn’t provide a good database-based definition that would improve upon the character count in terms of normalizing the amount of information conveyed. While East Asian Width brings minimum and maximum closer, it unfairly singles out Hangul syllables and kana without considering other syllabaries, because normalizing length for information conveyed is not the purpose of East Asian Width.
Even if per-script (possibly non-integer) weights assigned to characters could make things fairer, it wouldn’t work well for the Latin script, which is all over the place in terms of language-dependent length.

Name	UTF-8	Δ%	UTF-16	Δ%	UTF-32	Δ%	EGC	Δ%	EAW	Δ%	Script
Cashinahua	4170	-57.6	4135	-53.0	4135	-52.9	4135	-52.3	4135	-52.4	Latn
Ashéninka Perené	4170	-57.6	4135	-53.0	4135	-52.9	4135	-52.3	4135	-52.4	Latn
Waama	4293	-56.3	4011	-54.4	4011	-54.4	4007	-53.8	4007	-53.9	Latn
Chickasaw	4850	-50.6	4685	-46.7	4685	-46.7	4587	-47.1	4587	-47.2	Latn
Bulu	4919	-49.9	4808	-45.3	4808	-45.3	4808	-44.5	4808	-44.7	Latn
Kulango, Bouna	5286	-46.2	4164	-52.6	4164	-52.6	4164	-51.9	4164	-52.1	Latn
Zapotec, Miahuatlán	5464	-44.4	5433	-38.2	5433	-38.2	5433	-37.3	5433	-37.5	Latn
Nyamwezi	5750	-41.5	5686	-35.3	5686	-35.3	5686	-34.4	5686	-34.6	Latn
Kaonde	5972	-39.2	5972	-32.1	5972	-32.0	5972	-31.1	5972	-31.3	Latn
Mixtec, Metlatónoc	6023	-38.7	5630	-36.0	5630	-35.9	5611	-35.2	5611	-35.4	Latn
Makonde	6100	-37.9	5946	-32.4	5946	-32.3	5946	-31.4	5946	-31.6	Latn
Sharanahua	6165	-37.3	6162	-29.9	6162	-29.9	6162	-28.9	6162	-29.1	Latn
Serer-Sine	6166	-37.3	6079	-30.9	6079	-30.8	6079	-29.8	6079	-30.0	Latn
Dinka, Northeastern	6214	-36.8	5815	-33.9	5815	-33.8	5775	-33.4	5775	-33.5	Latn
Okiek	6272	-36.2	6272	-28.7	6272	-28.6	6272	-27.6	6271	-27.8	Latn
Jola-Fonyi	6299	-35.9	6122	-30.4	6122	-30.3	6122	-29.3	6122	-29.5	Latn
Maninkakan, Eastern	6372	-35.2	5867	-33.3	5867	-33.2	5867	-32.3	5867	-32.5	Latn
Chinantec, Ojitlán	6463	-34.2	5957	-32.3	5957	-32.2	5957	-31.3	5957	-31.4	Latn
Soninke	6496	-33.9	6430	-26.9	6430	-26.8	6430	-25.8	6430	-26.0	Latn
Chokwe (Angola)	6596	-32.9	6565	-25.3	6565	-25.3	6565	-24.2	6565	-24.4	Latn
Chinese, Mandarin (Traditional)	6606	-32.8	2202	-75.0	2202	-74.9	2202	-74.6	4404	-49.3	Hant
Otomi, Mezquital	6614	-32.7	6438	-26.8	6438	-26.7	6379	-26.4	6379	-26.6	Latn
Chinese, Mandarin (Simplified)	6708	-31.7	2278	-74.1	2278	-74.1	2278	-73.7	4493	-48.3	Hans
Quechua (Unified Quichua, old Hispanic orthography)	6713	-31.7	6670	-24.2	6670	-24.1	6670	-23.0	6670	-23.2	Latn
Shilluk	6798	-30.8	6036	-31.4	6036	-31.3	6036	-30.3	6036	-30.5	Latn
Colorado	6798	-30.8	6797	-22.7	6797	-22.7	6796	-21.6	6794	-21.8	Latn
Dendi	6823	-30.6	6327	-28.1	6327	-28.0	6325	-27.0	6325	-27.2	Latn
Chinese, Jinyu	6848	-30.3	2284	-74.0	2284	-74.0	2284	-73.6	4566	-47.4	Hans
Chinese, Min Nan	6887	-29.9	2297	-73.9	2297	-73.9	2297	-73.5	4592	-47.1	Hans
Chinese, Gan	6889	-29.9	2297	-73.9	2297	-73.9	2297	-73.5	4593	-47.1	Hans
Vietnamese (Han nom)	6910	-29.7	2564	-70.8	2224	-74.7	2224	-74.3	4397	-49.4	Hani
Chinese, Hakka	6929	-29.5	2311	-73.7	2311	-73.7	2311	-73.3	4620	-46.8	Hans
Lunda	6968	-29.1	6968	-20.8	6968	-20.7	6968	-19.6	6968	-19.8	Latn
Chinese, Yue	6973	-29.0	2325	-73.6	2325	-73.5	2325	-73.2	4648	-46.5	Hani
Pular	6991	-28.9	6991	-20.5	6991	-20.4	6991	-19.3	6991	-19.5	Latn
Limba, West-Central	7007	-28.7	6257	-28.8	6257	-28.8	6257	-27.8	6257	-28.0	Latn
Naga, Ao	7019	-28.6	6729	-23.5	6729	-23.4	6729	-22.3	6729	-22.5	Latn
Mazahua Central	7052	-28.2	6750	-23.2	6750	-23.2	6517	-24.8	6517	-25.0	Latn
Chinese, Wu	7082	-27.9	2362	-73.1	2362	-73.1	2362	-72.7	4722	-45.6	Hans
Kpelle, Guinea	7139	-27.4	6136	-30.2	6136	-30.2	6136	-29.2	6136	-29.4	Latn
Amis	7206	-26.7	7206	-18.1	7206	-18.0	7206	-16.8	7206	-17.1	Latn
Baatonum	7255	-26.2	6788	-22.8	6788	-22.8	6779	-21.8	6779	-22.0	Latn
Tetun	7280	-25.9	7280	-17.2	7280	-17.2	7280	-16.0	7280	-16.2	Latn
Chinantec, Chiltepec	7304	-25.7	6468	-26.4	6468	-26.4	6262	-27.7	6262	-27.9	Latn
(Maiunan)	7312	-25.6	7312	-16.9	7312	-16.8	7312	-15.6	7312	-15.8	Latn
Tetun Dili	7357	-25.1	7225	-17.8	7225	-17.8	7225	-16.6	7225	-16.8	Latn
(Minjiang, written)	7366	-25.0	7363	-16.3	7363	-16.2	7363	-15.0	7363	-15.3	Latn
Quechua, Cusco	7369	-25.0	7309	-16.9	7309	-16.8	7309	-15.6	7309	-15.9	Latn
(Mijisa)	7393	-24.8	7393	-15.9	7393	-15.9	7393	-14.7	7392	-14.9	Latn
Drung	7412	-24.6	7412	-15.7	7412	-15.7	7412	-14.5	7412	-14.7	Latn
Mazatec, Ixcatlán	7442	-24.3	7261	-17.4	7261	-17.4	7261	-16.2	7261	-16.4	Latn
Rwanda	7456	-24.1	7456	-15.2	7456	-15.2	7456	-14.0	7456	-14.2	Latn
(Minjiang, spoken)	7512	-23.6	7509	-14.6	7509	-14.6	7509	-13.3	7509	-13.6	Latn
Sukuma	7532	-23.4	7452	-15.3	7452	-15.2	7452	-14.0	7452	-14.2	Latn
Makhuwa	7562	-23.0	7398	-15.9	7398	-15.8	7398	-14.6	7398	-14.8	Latn
Aymara, Central	7568	-23.0	7363	-16.3	7363	-16.2	7363	-15.0	7363	-15.3	Latn
Ido	7580	-22.9	7580	-13.8	7580	-13.7	7580	-12.5	7580	-12.8	Latn
Záparo	7591	-22.8	7583	-13.8	7583	-13.7	7583	-12.5	7583	-12.7	Latn
Bamanankan	7597	-22.7	6890	-21.7	6890	-21.6	6890	-20.5	6890	-20.7	Latn
Nyankore	7628	-22.4	7628	-13.3	7628	-13.2	7628	-12.0	7628	-12.2	Latn
Ndebele	7659	-22.1	7659	-12.9	7659	-12.8	7659	-11.6	7659	-11.8	Latn
Sãotomense	7712	-21.5	6956	-20.9	6956	-20.8	6956	-19.7	6956	-19.9	Latn
Pijin	7716	-21.5	7716	-12.3	7716	-12.2	7716	-11.0	7716	-11.2	Latn
Latin	7747	-21.2	7747	-11.9	7747	-11.8	7747	-10.6	7747	-10.8	Latn
Susu	7757	-21.1	7310	-16.9	7310	-16.8	7310	-15.6	7310	-15.9	Latn
Oroqen	7768	-21.0	7761	-11.7	7761	-11.7	7761	-10.4	7761	-10.7	Latn
Lozi	7825	-20.4	7825	-11.0	7825	-11.0	7825	-9.7	7825	-9.9	Latn
Latin (1)	7869	-19.9	7869	-10.5	7869	-10.5	7869	-9.2	7869	-9.4	Latn
Otuho	7890	-19.7	7890	-10.3	7890	-10.2	7801	-10.0	7712	-11.2	Latn
Achuar-Shiwiar (1)	7893	-19.7	7842	-10.8	7842	-10.8	7785	-10.2	7728	-11.0	Latn
Huastec (Veracruz)	7911	-19.5	7882	-10.4	7882	-10.3	7882	-9.0	7882	-9.3	Latn
Umbundu (011)	7941	-19.2	7910	-10.1	7910	-10.0	7910	-8.7	7910	-9.0	Latn
Nuosu	7953	-19.1	2663	-69.7	2663	-69.7	2663	-69.3	5308	-38.9	Yiii
Even	7969	-18.9	4320	-50.9	4320	-50.8	4320	-50.1	4320	-50.3	Cyrl
Q'eqchi'	7981	-18.8	7981	-9.2	7981	-9.2	7981	-7.9	7981	-8.1	Latn
Moba	7985	-18.7	7726	-12.1	7726	-12.1	7726	-10.8	7726	-11.1	Latn
Mam, Northern	7994	-18.7	7994	-9.1	7994	-9.0	7994	-7.7	7994	-8.0	Latn
Kabiyé	7997	-18.6	6193	-29.6	6193	-29.5	6193	-28.5	6193	-28.7	Latn
Kanuri, Central	8077	-17.8	7621	-13.3	7621	-13.3	7621	-12.0	7621	-12.3	Latn
Esperanto	8095	-17.6	7930	-9.8	7930	-9.8	7930	-8.5	7930	-8.7	Latn
Serbian (Latin)	8102	-17.6	7876	-10.4	7876	-10.4	7876	-9.1	7876	-9.3	Latn
Urarina	8127	-17.3	8125	-7.6	8125	-7.5	8125	-6.2	8125	-6.5	Latn
Kurdish, Central	8163	-16.9	7462	-15.1	7462	-15.1	7462	-13.9	7462	-14.1	Latn
Kurdish, Northern	8163	-16.9	7462	-15.1	7462	-15.1	7462	-13.9	7462	-14.1	Latn
Huitoto, Murui	8179	-16.8	7523	-14.5	7523	-14.4	7523	-13.2	7523	-13.4	Latn
Croatian	8201	-16.5	7996	-9.1	7996	-9.0	7996	-7.7	7996	-8.0	Latn
Bemba	8206	-16.5	8206	-6.7	8206	-6.6	8206	-5.3	8206	-5.5	Latn
Waorani	8209	-16.5	8137	-7.5	8137	-7.4	8052	-7.1	7967	-8.3	Latn
Gonja	8215	-16.4	7579	-13.8	7579	-13.8	7579	-12.5	7579	-12.8	Latn
Scots	8224	-16.3	8224	-6.5	8224	-6.4	8224	-5.1	8224	-5.3	Latn
Ndonga	8239	-16.2	8239	-6.3	8239	-6.2	8239	-4.9	8239	-5.2	Latn
Garifuna	8243	-16.1	7721	-12.2	7721	-12.1	7721	-10.9	7721	-11.1	Latn
Bosnian (Latin)	8259	-16.0	8049	-8.5	8049	-8.4	8049	-7.1	8049	-7.4	Latn
Twi (Akuapem)	8264	-15.9	7653	-13.0	7653	-12.9	7653	-11.7	7653	-11.9	Latn
Zulu	8265	-15.9	8261	-6.1	8261	-6.0	8261	-4.7	8261	-4.9	Latn
Guarayu	8280	-15.7	8098	-7.9	8098	-7.9	8098	-6.5	8098	-6.8	Latn
Swahili	8315	-15.4	8315	-5.4	8315	-5.4	8315	-4.0	8315	-4.3	Latn
Zhuang, Yongbei	8318	-15.4	8316	-5.4	8316	-5.4	8316	-4.0	8316	-4.3	Latn
Wolof	8321	-15.3	7940	-9.7	7940	-9.6	7940	-8.4	7940	-8.6	Latn
Zapotec, Güilá	8364	-14.9	8328	-5.3	8328	-5.2	8328	-3.9	8328	-4.1	Latn
Oromo, Borana-Arsi-Guji	8381	-14.7	8381	-4.7	8381	-4.6	8381	-3.3	8381	-3.5	Latn
Welsh	8382	-14.7	8247	-6.2	8247	-6.2	8247	-4.8	8247	-5.1	Latn
Tok Pisin	8399	-14.5	8393	-4.6	8393	-4.5	8393	-3.1	8393	-3.4	Latn
Awa-Cuaiquer	8405	-14.5	8391	-4.6	8391	-4.5	8309	-4.1	8227	-5.3	Latn
Luvale	8411	-14.4	8411	-4.4	8411	-4.3	8411	-2.9	8411	-3.2	Latn
Crioulo, Upper Guinea (008)	8414	-14.4	8225	-6.5	8225	-6.4	8225	-5.1	8225	-5.3	Latn
Afrikaans	8427	-14.2	8365	-4.9	8365	-4.8	8365	-3.5	8365	-3.7	Latn
Faroese	8454	-14.0	7854	-10.7	7854	-10.6	7854	-9.4	7854	-9.6	Latn
Fulfulde, Nigerian	8455	-14.0	8135	-7.5	8135	-7.4	8135	-6.1	8135	-6.4	Latn
Norwegian, Nynorsk	8461	-13.9	8268	-6.0	8268	-5.9	8268	-4.6	8268	-4.8	Latn
Yagua	8468	-13.8	8432	-4.1	8432	-4.1	8432	-2.7	8432	-2.9	Latn
Rundi	8498	-13.5	8498	-3.4	8498	-3.3	8498	-1.9	8498	-2.2	Latn
Norwegian, Bokmål	8500	-13.5	8360	-4.9	8360	-4.9	8360	-3.5	8360	-3.8	Latn
Umbundu	8503	-13.5	8415	-4.3	8415	-4.2	8415	-2.9	8415	-3.1	Latn
English	8565	-12.8	8555	-2.7	8555	-2.7	8555	-1.3	8555	-1.5	Latn
Yao	8574	-12.8	8574	-2.5	8574	-2.4	8574	-1.1	8574	-1.3	Latn
Nomatsiguenga	8575	-12.7	8432	-4.1	8432	-4.1	8432	-2.7	8432	-2.9	Latn
Mapudungun	8585	-12.6	8366	-4.9	8366	-4.8	8366	-3.5	8366	-3.7	Latn
Fijian	8586	-12.6	8584	-2.4	8584	-2.3	8584	-0.9	8584	-1.2	Latn
Tamazight, Central Atlas	8587	-12.6	8226	-6.5	8226	-6.4	8226	-5.1	8226	-5.3	Latn
Nyanja (Chinyanja)	8590	-12.6	8590	-2.3	8590	-2.3	8590	-0.9	8590	-1.1	Latn
Yapese	8635	-12.1	8473	-3.7	8473	-3.6	8473	-2.2	8473	-2.5	Latn
Crioulo, Upper Guinea	8636	-12.1	8632	-1.8	8632	-1.8	8632	-0.4	8632	-0.6	Latn
Secoya	8651	-12.0	8155	-7.3	8155	-7.2	8137	-6.1	8137	-6.3	Latn
Wayuu	8664	-11.8	8077	-8.2	8077	-8.1	8077	-6.8	8077	-7.0	Latn
Lingala	8668	-11.8	8654	-1.6	8654	-1.5	8654	-0.1	8654	-0.4	Latn
Haitian Creole French (Kreyol)	8680	-11.7	8535	-2.9	8535	-2.9	8535	-1.5	8535	-1.8	Latn
Tonga	8685	-11.6	8685	-1.2	8685	-1.2	8685	0.2	8685	-0.0	Latn
Seselwa Creole French	8706	-11.4	8697	-1.1	8697	-1.0	8697	0.4	8697	0.1	Latn
Mende	8707	-11.4	8010	-8.9	8010	-8.9	8010	-7.6	8010	-7.8	Latn
Nyanja (Chechewa)	8725	-11.2	8725	-0.8	8725	-0.7	8725	0.7	8725	0.4	Latn
Hani	8767	-10.8	8767	-0.3	8767	-0.2	8767	1.2	8767	0.9	Latn
Slovenian	8772	-10.7	8520	-3.1	8520	-3.0	8520	-1.7	8520	-1.9	Latn
Hmong, Southern Qiandong	8792	-10.5	8792	-0.0	8792	0.0	8792	1.5	8792	1.2	Latn
Chokwe	8808	-10.4	8808	0.2	8808	0.2	8808	1.7	8808	1.4	Latn
Pipil	8831	-10.1	8825	0.4	8825	0.4	8825	1.8	8825	1.6	Latn
(Bizisa)	8847	-10.0	8847	0.6	8847	0.7	8847	2.1	8847	1.8	Latn
Quechua, Cajamarca	8858	-9.9	8851	0.6	8851	0.7	8851	2.1	8851	1.9	Latn
Kasem	8868	-9.8	8445	-4.0	8445	-3.9	8445	-2.5	8445	-2.8	Latn
Romani, Balkan	8875	-9.7	8606	-2.1	8606	-2.1	8606	-0.7	8606	-0.9	Latn
Turkish	8877	-9.7	8225	-6.5	8225	-6.4	8225	-5.1	8225	-5.3	Latn
Fante	8898	-9.5	8229	-6.4	8229	-6.4	8229	-5.0	8229	-5.3	Latn
Basque	8907	-9.4	8907	1.3	8907	1.4	8907	2.8	8907	2.5	Latn
Ganda	8962	-8.8	8962	1.9	8962	2.0	8962	3.4	8962	3.2	Latn
Occitan	8963	-8.8	8661	-1.5	8661	-1.4	8661	-0.0	8661	-0.3	Latn
Xhosa	8969	-8.7	8881	1.0	8881	1.1	8881	2.5	8881	2.2	Latn
Breton	8982	-8.6	8661	-1.5	8661	-1.4	8661	-0.0	8661	-0.3	Latn
Veps	8985	-8.6	8428	-4.2	8428	-4.1	8428	-2.7	8428	-3.0	Latn
Quechua, Arequipa-La Unión	8988	-8.5	8969	2.0	8969	2.1	8969	3.5	8969	3.2	Latn
Friulian	9003	-8.4	8688	-1.2	8688	-1.1	8688	0.3	8688	0.0	Latn
Swedish	9008	-8.3	8612	-2.1	8612	-2.0	8612	-0.6	8612	-0.9	Latn
Danish	9010	-8.3	8831	0.4	8831	0.5	8831	1.9	8831	1.6	Latn
Aromanian	9020	-8.2	8694	-1.1	8694	-1.1	8694	0.3	8694	0.1	Latn
Madura	9023	-8.2	9023	2.6	9023	2.7	9023	4.1	9023	3.9	Latn
Romani, Balkan (1)	9035	-8.1	8739	-0.6	8739	-0.6	8739	0.9	8739	0.6	Latn
Chayahuita	9065	-7.8	8639	-1.8	8639	-1.7	8639	-0.3	8639	-0.6	Latn
Icelandic	9070	-7.7	8249	-6.2	8249	-6.1	8249	-4.8	8249	-5.1	Latn
Krio	9086	-7.5	8139	-7.4	8139	-7.4	8139	-6.1	8139	-6.3	Latn
Estonian	9093	-7.5	8800	0.1	8800	0.1	8800	1.6	8800	1.3	Latn
Aja	9099	-7.4	8077	-8.2	8077	-8.1	8069	-6.9	8069	-7.1	Latn
Sorbian, Upper	9108	-7.3	8442	-4.0	8442	-3.9	8442	-2.6	8442	-2.8	Latn
Sotho, Southern	9136	-7.0	9136	3.9	9136	4.0	9136	5.4	9136	5.2	Latn
Catalan-Valencian-Balear	9141	-7.0	8823	0.3	8823	0.4	8823	1.8	8823	1.6	Latn
Luba-Kasai	9143	-7.0	9143	4.0	9143	4.0	9143	5.5	9143	5.2	Latn
Minangkabau	9175	-6.6	9167	4.2	9167	4.3	9167	5.8	9167	5.5	Latn
Bari	9178	-6.6	8555	-2.7	8555	-2.7	8555	-1.3	8555	-1.5	Latn
Portuguese (Brazil)	9219	-6.2	8887	1.1	8887	1.1	8887	2.6	8887	2.3	Latn
Huastec (San Luís Potosí)	9222	-6.2	8826	0.4	8826	0.4	8826	1.9	8826	1.6	Latn
Czech	9225	-6.1	8126	-7.6	8126	-7.5	8126	-6.2	8126	-6.5	Latn
Purepecha	9234	-6.0	9082	3.3	9082	3.3	9082	4.8	9082	4.5	Latn
Fon	9244	-5.9	7952	-9.6	7952	-9.5	7943	-8.3	7943	-8.6	Latn
Twi (Asante)	9246	-5.9	8374	-4.8	8374	-4.7	8374	-3.4	8374	-3.6	Latn
Papiamentu	9249	-5.9	9237	5.0	9237	5.1	9237	6.6	9237	6.3	Latn
Slovak	9266	-5.7	8378	-4.7	8378	-4.7	8378	-3.3	8378	-3.6	Latn
Malagasy, Plateau	9272	-5.6	9272	5.4	9272	5.5	9272	7.0	9272	6.7	Latn
Romansch (Vallader)	9300	-5.4	9048	2.9	9048	3.0	9048	4.4	9048	4.1	Latn
Ladin	9324	-5.1	8740	-0.6	8740	-0.5	8740	0.9	8740	0.6	Latn
Mbundu	9327	-5.1	9317	5.9	9317	6.0	9317	7.5	9317	7.2	Latn
Occitan (Auvergnat)	9330	-5.1	8642	-1.7	8642	-1.7	8642	-0.3	8642	-0.5	Latn
Lithuanian	9339	-5.0	8794	0.0	8794	0.1	8794	1.5	8794	1.2	Latn
Ladino	9348	-4.9	9345	6.3	9345	6.3	9345	7.8	9345	7.6	Latn
Mískito	9353	-4.8	9345	6.3	9345	6.3	9345	7.8	9345	7.6	Latn
Assyrian Neo-Aramaic	9363	-4.7	5186	-41.0	5186	-41.0	5127	-40.8	5127	-41.0	Syrc
Waray-Waray	9387	-4.5	9387	6.7	9387	6.8	9387	8.3	9387	8.0	Latn
Korean	9391	-4.4	3856	-56.2	3856	-56.1	3856	-55.5	6623	-23.8	Hang
Somali	9403	-4.3	9403	6.9	9403	7.0	9403	8.5	9403	8.2	Latn
Finnish	9404	-4.3	9023	2.6	9023	2.7	9023	4.1	9023	3.9	Latn
Romansch (Sursilvan)	9421	-4.1	9300	5.8	9300	5.8	9300	7.3	9300	7.0	Latn
Chin, Tedim	9441	-3.9	9431	7.2	9431	7.3	9431	8.8	9431	8.6	Latn
Latvian	9447	-3.9	8582	-2.4	8582	-2.3	8582	-1.0	8582	-1.2	Latn
Romansch (Grischun)	9449	-3.8	9293	5.7	9293	5.7	9293	7.2	9293	7.0	Latn
Gagauz	9451	-3.8	8510	-3.2	8510	-3.2	8510	-1.8	8510	-2.0	Latn
Dagbani	9458	-3.8	8896	1.2	8896	1.2	8896	2.7	8896	2.4	Latn
Finnish, Kven	9464	-3.7	9123	3.7	9123	3.8	9123	5.3	9123	5.0	Latn
Corsican	9475	-3.6	8922	1.5	8922	1.5	8922	3.0	8922	2.7	Latn
Koongo (Angola)	9486	-3.5	9416	7.1	9416	7.1	9416	8.7	9356	7.7	Latn
Ditammari	9487	-3.5	7867	-10.5	7867	-10.5	7748	-10.6	7748	-10.8	Latn
Portuguese (Portugal)	9501	-3.3	9154	4.1	9154	4.2	9154	5.6	9154	5.4	Latn
Manx	9504	-3.3	9440	7.3	9440	7.4	9440	8.9	9440	8.7	Latn
Chamorro	9506	-3.3	9504	8.1	9504	8.1	9504	9.7	9504	9.4	Latn
Galician	9510	-3.2	9223	4.9	9223	4.9	9223	6.4	9223	6.2	Latn
Occitan (Languedocien)	9522	-3.1	9364	6.5	9364	6.6	9364	8.1	9364	7.8	Latn
Romansch (Puter)	9538	-2.9	9303	5.8	9303	5.9	9303	7.4	9303	7.1	Latn
Ligurian	9557	-2.7	8942	1.7	8942	1.8	8942	3.2	8942	2.9	Latn
Quechua, Huaylas Ancash	9563	-2.7	9471	7.7	9471	7.8	9471	9.3	9471	9.0	Latn
Mizo	9576	-2.6	9489	7.9	9489	8.0	9489	9.5	9489	9.2	Latn
Tiv	9585	-2.5	9490	7.9	9490	8.0	9490	9.5	9490	9.2	Latn
Interlingua	9588	-2.4	9588	9.0	9588	9.1	9588	10.7	9588	10.4	Latn
Koongo	9596	-2.4	9596	9.1	9596	9.2	9596	10.7	9596	10.5	Latn
Pohnpeian	9603	-2.3	9603	9.2	9603	9.3	9603	10.8	9603	10.5	Latn
Polish	9613	-2.2	9111	3.6	9111	3.7	9111	5.1	9111	4.9	Latn
Ga	9614	-2.2	8262	-6.0	8262	-6.0	8257	-4.7	8257	-5.0	Latn
Kituba	9630	-2.0	9630	9.5	9630	9.6	9630	11.1	9630	10.8	Latn
Palauan	9654	-1.8	9654	9.8	9654	9.9	9654	11.4	9654	11.1	Latn
Guaraní, Paraguayan	9658	-1.7	8956	1.8	8956	1.9	8956	3.4	8956	3.1	Latn
Frisian, Western	9660	-1.7	9495	8.0	9495	8.0	9495	9.6	9495	9.3	Latn
Albanian, Tosk	9703	-1.3	8972	2.0	8972	2.1	8972	3.5	8972	3.3	Latn
Italian	9739	-0.9	9674	10.0	9674	10.1	9674	11.6	9674	11.3	Latn
Marshallese	9758	-0.7	9758	11.0	9758	11.0	9758	12.6	9758	12.3	Latn
Spanish	9759	-0.7	9574	8.9	9574	8.9	9574	10.5	9574	10.2	Latn
Venetian	9764	-0.6	9083	3.3	9083	3.4	9083	4.8	9083	4.5	Latn
Romansch (Sutsilvan)	9764	-0.6	9459	7.6	9459	7.6	9459	9.2	9459	8.9	Latn
Huastec (Sierra de Otontepec)	9778	-0.5	9430	7.2	9430	7.3	9430	8.8	9430	8.5	Latn
Comorian, Ngazidja	9783	-0.4	9783	11.2	9783	11.3	9783	12.9	9783	12.6	Latn
Lamnso'	9792	-0.4	7828	-11.0	7828	-10.9	7648	-11.7	7648	-12.0	Latn
Hawaiian	9812	-0.2	8588	-2.3	8588	-2.3	8588	-0.9	8588	-1.2	Latn
Romansch (Surmiran)	9827	0.0	9662	9.9	9662	9.9	9662	11.5	9662	11.2	Latn
German, Standard (1996)	9828	0.0	9696	10.3	9696	10.3	9696	11.9	9696	11.6	Latn
Mixe, Totontepec	9829	0.0	8351	-5.0	8351	-5.0	8351	-3.6	8351	-3.9	Latn
German, Standard (1901)	9830	0.0	9692	10.2	9692	10.3	9692	11.9	9692	11.6	Latn
Talysh	9836	0.1	8180	-7.0	8180	-6.9	8180	-5.6	8180	-5.8	Latn
Aceh	9845	0.2	9729	10.6	9729	10.7	9729	12.3	9729	12.0	Latn
Maltese	9846	0.2	9198	4.6	9198	4.7	9198	6.2	9198	5.9	Latn
Chin, Matu	9854	0.3	9840	11.9	9840	12.0	9840	13.6	9840	13.3	Latn
Asturian	9858	0.3	9636	9.6	9636	9.6	9636	11.2	9636	10.9	Latn
Gaelic, Scottish	9859	0.3	9646	9.7	9646	9.8	9646	11.3	9646	11.0	Latn
Chuukese	9878	0.5	9878	12.3	9878	12.4	9878	14.0	9878	13.7	Latn
Nyemba	9882	0.6	9881	12.4	9881	12.4	9881	14.0	9881	13.7	Latn
Amarakaeri	9917	0.9	9499	8.0	9499	8.1	9086	4.9	9086	4.6	Latn
Candoshi-Shapra	9918	0.9	9862	12.1	9862	12.2	9862	13.8	9862	13.5	Latn
Siona	9933	1.1	9161	4.2	9161	4.2	8826	1.9	8748	0.7	Latn
Dangme	9936	1.1	8796	0.0	8796	0.1	8779	1.3	8779	1.0	Latn
Shona	9943	1.2	9943	13.1	9943	13.1	9943	14.7	9943	14.4	Latn
Páez	9980	1.6	9869	12.2	9869	12.3	9869	13.9	9869	13.6	Latn
Romansch	10003	1.8	9866	12.2	9866	12.3	9866	13.9	9866	13.6	Latn
Pampangan	10005	1.8	10005	13.8	10005	13.8	10005	15.5	10005	15.2	Latn
Cebuano	10008	1.8	10008	13.8	10008	13.9	10008	15.5	10008	15.2	Latn
Tagalog	10013	1.9	10013	13.9	10013	13.9	10013	15.6	10013	15.3	Latn
Romagnolo	10029	2.1	9511	8.2	9511	8.2	9511	9.8	9511	9.5	Latn
French	10030	2.1	9598	9.1	9598	9.2	9598	10.8	9598	10.5	Latn
Sotho, Northern	10036	2.1	9771	11.1	9771	11.2	9771	12.8	9771	12.5	Latn
Indonesian	10059	2.4	10059	14.4	10059	14.5	10059	16.1	10059	15.8	Latn
Tswana	10067	2.4	10047	14.2	10047	14.3	10047	15.9	10047	15.6	Latn
Bugis	10070	2.5	10070	14.5	10070	14.6	10070	16.2	10070	15.9	Latn
Sunda	10071	2.5	10071	14.5	10071	14.6	10071	16.2	10071	15.9	Latn
Uzbek, Northern (Latin)	10088	2.7	9836	11.8	9836	11.9	9836	13.5	9836	13.2	Latn
Gaelic, Irish	10114	2.9	9591	9.1	9591	9.1	9591	10.7	9591	10.4	Latn
Hindustani, Sarnami	10116	2.9	9963	13.3	9963	13.4	9963	15.0	9963	14.7	Latn
Tzeltal, Oxchuc	10119	3.0	9780	11.2	9780	11.3	9780	12.9	9780	12.6	Latn
Turkmen (Latin)	10124	3.0	9185	4.4	9185	4.5	9185	6.0	9185	5.7	Latn
Dagaare, Southern	10141	3.2	9495	8.0	9495	8.0	9477	9.4	9477	9.1	Latn
Igbo	10151	3.3	8653	-1.6	8653	-1.5	8653	-0.1	8653	-0.4	Latn
Picard	10151	3.3	9175	4.3	9175	4.4	9175	5.9	9175	5.6	Latn
Micmac	10162	3.4	9234	5.0	9234	5.1	9234	6.6	9234	6.3	Latn
Uyghur (Latin)	10186	3.7	9999	13.7	9999	13.8	9999	15.4	9999	15.1	Latn
Malay (Latin)	10189	3.7	10188	15.9	10188	15.9	10188	17.6	10188	17.3	Latn
Azerbaijani, North (Latin)	10198	3.8	8717	-0.9	8717	-0.8	8717	0.6	8717	0.3	Latn
Japanese	10227	4.1	3437	-60.9	3437	-60.9	3437	-60.3	6832	-21.4	Jpan
Bislama	10233	4.1	10233	16.4	10233	16.4	10233	18.1	10233	17.8	Latn
Bali	10235	4.2	10235	16.4	10235	16.5	10235	18.1	10235	17.8	Latn
Occitan (Francoprovençal, Savoie)	10240	4.2	8665	-1.5	8665	-1.4	8665	0.0	8665	-0.3	Latn
Themne	10244	4.2	8323	-5.4	8323	-5.3	8323	-3.9	8323	-4.2	Latn
Karelian	10245	4.3	9874	12.3	9874	12.4	9761	12.6	9648	11.0	Latn
Dutch	10247	4.3	10246	16.5	10246	16.6	10246	18.2	10246	17.9	Latn
Bamun	10248	4.3	8744	-0.6	8744	-0.5	8744	0.9	8744	0.6	Latn
Edo	10262	4.4	10260	16.7	10260	16.8	10260	18.4	10260	18.1	Latn
Bicolano, Central	10263	4.4	10263	16.7	10263	16.8	10263	18.4	10263	18.1	Latn
Tsonga (Mozambique)	10274	4.5	10047	14.2	10047	14.3	10047	15.9	10047	15.6	Latn
Quechua, Ayacucho	10295	4.8	10273	16.8	10273	16.9	10273	18.6	10273	18.2	Latn
Mina	10300	4.8	9085	3.3	9085	3.4	9040	4.3	9040	4.1	Latn
Romanian (2006)	10303	4.8	9683	10.1	9683	10.2	9683	11.7	9683	11.5	Latn
Luxembourgeois	10306	4.9	9998	13.7	9998	13.8	9998	15.4	9998	15.1	Latn
Romanian (1993)	10311	4.9	9691	10.2	9691	10.3	9691	11.8	9691	11.5	Latn
Romanian (1953)	10317	5.0	9691	10.2	9691	10.3	9691	11.8	9691	11.5	Latn
Mozarabic	10317	5.0	10184	15.8	10184	15.9	10184	17.5	10184	17.2	Latn
Sardinian, Logudorese	10323	5.0	10195	15.9	10195	16.0	10195	17.7	10195	17.3	Latn
Haitian Creole French (Popular)	10339	5.2	10103	14.9	10103	15.0	10103	16.6	10103	16.3	Latn
Hiligaynon	10405	5.9	10405	18.3	10405	18.4	10405	20.1	10405	19.8	Latn
Shor	10414	6.0	5724	-34.9	5724	-34.9	5724	-33.9	5724	-34.1	Cyrl
Sango	10428	6.1	8644	-1.7	8644	-1.6	8644	-0.2	8644	-0.5	Latn
Ilocano	10429	6.1	10429	18.6	10429	18.7	10429	20.4	10429	20.0	Latn
Occitan (Francoprovençal, Fribourg)	10439	6.2	9226	4.9	9226	5.0	9226	6.5	9226	6.2	Latn
Niue	10444	6.3	10443	18.8	10443	18.8	10443	20.5	10443	20.2	Latn
Comorian, Maore	10458	6.4	10340	17.6	10340	17.7	10340	19.3	10340	19.0	Latn
Chin, Falam	10467	6.5	10467	19.0	10467	19.1	10467	20.8	10467	20.5	Latn
Ibibio	10468	6.5	10467	19.0	10467	19.1	10467	20.8	10467	20.5	Latn
Lingala (tones)	10476	6.6	8990	2.2	8990	2.3	8760	1.1	8760	0.8	Latn
Hebrew	10502	6.9	5822	-33.8	5822	-33.8	5822	-32.8	5822	-33.0	Hebr
Saxon, Low	10539	7.2	10318	17.3	10318	17.4	10318	19.1	10318	18.8	Latn
Venda	10620	8.1	10106	14.9	10106	15.0	10106	16.6	10106	16.3	Latn
Mòoré	10621	8.1	9427	7.2	9427	7.3	9427	8.8	9427	8.5	Latn
Quichua, Chimborazo Highland	10651	8.4	10549	20.0	10549	20.0	10436	20.4	10323	18.8	Latn
Saami, North	10654	8.4	9944	13.1	9944	13.2	9944	14.8	9944	14.5	Latn
Occitan (Francoprovençal, Valais)	10662	8.5	9413	7.0	9413	7.1	9413	8.6	9413	8.3	Latn
Walloon	10714	9.0	9785	11.3	9785	11.3	9785	12.9	9785	12.6	Latn
Hungarian	10718	9.1	9783	11.2	9783	11.3	9783	12.9	9783	12.6	Latn
Nzema	10740	9.3	9439	7.3	9439	7.4	9439	8.9	9439	8.6	Latn
Tsonga (Zimbabwe)	10758	9.5	10546	19.9	10546	20.0	10546	21.7	10546	21.4	Latn
Quechua, North Junín	10765	9.5	10756	22.3	10756	22.4	10756	24.1	10756	23.8	Latn
Hmong, Northern Qiandong	10801	9.9	10801	22.8	10801	22.9	10801	24.7	10801	24.3	Latn
Khasi	10810	10.0	10605	20.6	10605	20.7	10605	22.4	10605	22.1	Latn
K'iche', Central	10817	10.1	10817	23.0	10817	23.1	10817	24.8	10817	24.5	Latn
Javanese (Latin)	10863	10.5	10863	23.5	10863	23.6	10863	25.4	10863	25.0	Latn
Occitan (Francoprovençal, Vaud)	10885	10.8	9757	11.0	9757	11.0	9757	12.6	9757	12.3	Latn
Shuar	10930	11.2	10533	19.8	10533	19.9	10533	21.6	10533	21.2	Latn
Baoulé	10946	11.4	10204	16.0	10204	16.1	10204	17.8	10204	17.4	Latn
Totonac, Papantla	10955	11.5	10955	24.6	10955	24.7	10955	26.4	10955	26.1	Latn
Evenki	10962	11.5	5948	-32.4	5948	-32.3	5776	-33.3	5776	-33.5	Cyrl
Kabuverdianu	10971	11.6	10334	17.5	10334	17.6	10334	19.3	10325	18.8	Latn
Jula	11038	12.3	8719	-0.9	8719	-0.8	8719	0.6	8719	0.4	Latn
Éwé	11107	13.0	9967	13.3	9967	13.4	9950	14.8	9950	14.5	Latn
Asháninka	11167	13.6	11164	27.0	11164	27.0	11164	28.8	11164	28.5	Latn
Hmong Njua	11179	13.8	11179	27.1	11179	27.2	11179	29.0	11179	28.7	Latn
Mbundu (009)	11200	14.0	11133	26.6	11133	26.7	11133	28.5	11133	28.1	Latn
Arabic, Standard	11214	14.1	6183	-29.7	6183	-29.6	6166	-28.8	6166	-29.0	Arab
Samoan	11231	14.3	11231	27.7	11231	27.8	11231	29.6	11231	29.3	Latn
Quechua, Margos-Yarowilca-Lauricocha	11260	14.6	11108	26.3	11108	26.4	11108	28.2	11108	27.9	Latn
Achuar-Shiwiar	11299	15.0	11296	28.5	11296	28.5	11296	30.4	11296	30.0	Latn
Tojolabal	11465	16.7	10173	15.7	10173	15.8	10173	17.4	10173	17.1	Latn
Bushi	11487	16.9	10980	24.9	10980	24.9	10980	26.7	10980	26.4	Latn
Osetin	11528	17.3	6370	-27.6	6370	-27.5	6370	-26.5	6370	-26.7	Cyrl
Tzotzil (Chamula)	11558	17.6	10703	21.7	10703	21.8	10703	23.5	10703	23.2	Latn
Rarotongan	11562	17.7	11527	31.1	11527	31.2	11527	33.0	11527	32.7	Latn
Maya, Yucatán	11732	19.4	10675	21.4	10675	21.5	10675	23.2	10675	22.9	Latn
Quechua, Northern Conchucos Ancash	11786	19.9	11782	34.0	11782	34.1	11782	36.0	11782	35.6	Latn
Yanomamö	11913	21.2	10470	19.1	10470	19.1	10470	20.8	10470	20.5	Latn
Aguaruna	11918	21.3	11854	34.8	11854	34.9	11854	36.8	11854	36.4	Latn
Hausa (Niger)	12078	22.9	11831	34.5	11831	34.6	11831	36.5	11831	36.2	Latn
Hausa (Nigeria)	12078	22.9	11863	34.9	11863	35.0	11863	36.9	11863	36.5	Latn
Vietnamese	12182	24.0	8877	0.9	8877	1.0	8877	2.4	8877	2.2	Latn
Chin, Haka	12231	24.5	12231	39.1	12231	39.2	12231	41.2	12231	40.8	Latn
Quechua, Ambo-Pasco	12327	25.4	12181	38.5	12181	38.6	12181	40.6	12181	40.2	Latn
Cashibo-Cacataibo	12349	25.7	11514	30.9	11514	31.0	11514	32.9	11514	32.5	Latn
Tem	12418	26.4	8878	1.0	8878	1.0	8246	-4.8	8246	-5.1	Latn
Ojibwa, Northwestern	12419	26.4	4775	-45.7	4775	-45.7	4775	-44.9	4775	-45.0	Cans
Pidgin, Nigerian	12424	26.4	12424	41.3	12424	41.4	12424	43.4	12424	43.0	Latn
Tahitian	12449	26.7	12244	39.2	12244	39.3	12244	41.3	12244	40.9	Latn
Amahuaca	12530	27.5	12530	42.5	12530	42.6	12530	44.6	12530	44.2	Latn
Lobi	12645	28.7	10435	18.7	10435	18.7	10435	20.4	10435	20.1	Latn
Cree, Swampy	12705	29.3	4849	-44.9	4849	-44.8	4849	-44.0	4849	-44.2	Cans
Navajo	12835	30.6	9981	13.5	9981	13.6	9803	13.1	9803	12.8	Latn
Quechua, South Bolivian	12924	31.5	12902	46.7	12902	46.8	12902	48.9	12902	48.5	Latn
Kaqchikel, Central	12943	31.7	12616	43.5	12616	43.6	12616	45.6	12616	45.2	Latn
Maori	12994	32.2	12993	47.7	12993	47.8	12993	49.9	12993	49.6	Latn
Seraiki	13020	32.5	7303	-17.0	7303	-16.9	7302	-15.7	7302	-16.0	Arab
Ticuna	13137	33.7	10508	19.5	10508	19.6	9886	14.1	9886	13.8	Latn
Arabela	13256	34.9	13255	50.7	13255	50.8	13255	53.0	13255	52.6	Latn
Swati	13372	36.1	13320	51.5	13320	51.6	13320	53.7	13320	53.3	Latn
Komi-Permyak	13499	37.4	7378	-16.1	7378	-16.0	7378	-14.9	7378	-15.1	Cyrl
Farsi, Western	13597	38.4	7537	-14.3	7537	-14.2	7460	-13.9	7460	-14.1	Arab
Yukaghir, Northern	13618	38.6	7366	-16.2	7366	-16.2	7366	-15.0	7366	-15.2	Cyrl
Dari	13669	39.1	7607	-13.5	7607	-13.4	7561	-12.7	7561	-13.0	Arab
Pintupi-Luritja	13736	39.8	13736	56.2	13736	56.3	13736	58.5	13736	58.1	Latn
Urdu	13859	41.0	7768	-11.7	7768	-11.6	7733	-10.8	7733	-11.0	Arab
Panjabi, Western	13996	42.4	7904	-10.1	7904	-10.1	7893	-8.9	7893	-9.2	Arab
Tongan	14017	42.6	12453	41.6	12453	41.7	12453	43.7	12453	43.3	Latn
Yoruba	14059	43.1	10238	16.4	10238	16.5	9276	7.1	9276	6.8	Latn
Inuktitut, Greenlandic	14067	43.1	14067	60.0	14067	60.1	14067	62.3	14067	61.9	Latn
Serbian (Cyrillic)	14090	43.4	7740	-12.0	7740	-11.9	7740	-10.7	7740	-10.9	Cyrl
Urdu (2)	14108	43.6	7904	-10.1	7904	-10.1	7868	-9.2	7868	-9.4	Arab
Nanai	14148	44.0	7666	-12.8	7666	-12.8	7636	-11.9	7636	-12.1	Cyrl
Caquinte	14250	45.0	14246	62.0	14246	62.1	14246	64.4	14246	64.0	Latn
Tigrigna	14270	45.2	5502	-37.4	5502	-37.4	5502	-36.5	5502	-36.7	Ethi
Bosnian (Cyrillic)	14404	46.6	7906	-10.1	7906	-10.0	7906	-8.8	7906	-9.0	Cyrl
Malay (Arabic)	14410	46.6	7899	-10.2	7899	-10.1	7899	-8.8	7899	-9.1	Arab
Konjo	14620	48.8	14620	66.2	14620	66.4	14620	68.7	14620	68.3	Latn
Pashto, Northern	14727	49.9	8276	-5.9	8276	-5.8	8274	-4.5	8274	-4.8	Arab
Bora	14934	52.0	11819	34.4	11819	34.5	11659	34.6	11659	34.2	Latn
Quechua, Huamalíes-Dos de Mayo Huánuco	14973	52.4	14772	68.0	14772	68.1	14772	70.5	14772	70.0	Latn
Toba	15250	55.2	14672	66.8	14672	67.0	14672	69.3	14672	68.9	Latn
Nahuatl, Central	15460	57.3	15457	75.8	15457	75.9	15457	78.4	15457	77.9	Latn
Vai	15555	58.3	6931	-21.2	6931	-21.1	6931	-20.0	6931	-20.2	Vaii
Tatar	15601	58.8	8493	-3.4	8493	-3.4	8493	-2.0	8493	-2.2	Cyrl
Tajiki	15606	58.8	8594	-2.3	8594	-2.2	8594	-0.8	8594	-1.1	Cyrl
Macedonian	15843	61.2	8704	-1.0	8704	-1.0	8704	0.5	8704	0.2	Cyrl
Ukrainian	16109	63.9	8785	-0.1	8785	-0.0	8785	1.4	8785	1.1	Cyrl
Azerbaijani, North (Cyrillic)	16117	64.0	8733	-0.7	8733	-0.6	8733	0.8	8733	0.5	Cyrl
Orok	16118	64.0	8696	-1.1	8696	-1.0	8251	-4.8	8251	-5.0	Cyrl
Amharic	16144	64.3	5382	-38.8	5382	-38.8	5382	-37.9	5382	-38.1	Ethi
Kazakh	16273	65.6	8791	-0.0	8791	0.0	8791	1.5	8791	1.2	Cyrl
Mongolian, Halh (Cyrillic)	16295	65.8	8837	0.5	8837	0.6	8837	2.0	8837	1.7	Cyrl
Tamazight, Standard Morocan	16301	65.9	6371	-27.6	6371	-27.5	6371	-26.5	6371	-26.7	Tfng
Turkmen (Cyrillic)	16438	67.3	8826	0.4	8826	0.4	8826	1.9	8826	1.6	Cyrl
Altai, Southern	16508	68.0	8865	0.8	8865	0.9	8865	2.3	8865	2.0	Cyrl
Shipibo-Conibo	16674	69.7	16391	86.4	16391	86.5	16391	89.2	16391	88.7	Latn
Bulgarian	16844	71.4	9228	4.9	9228	5.0	9228	6.5	9228	6.2	Cyrl
Armenian	16853	71.5	9038	2.8	9038	2.8	9038	4.3	9038	4.0	Armn
Chachi	17042	73.4	16912	92.3	16912	92.4	16911	95.2	16910	94.6	Latn
Belarusan	17117	74.2	9307	5.8	9307	5.9	9307	7.4	9307	7.1	Cyrl
Tai Dam	17301	76.1	7181	-18.3	7181	-18.3	6423	-25.9	6423	-26.1	Tavt
Abkhaz	17318	76.2	9280	5.5	9280	5.6	9280	7.1	9280	6.8	Cyrl
Yaneshaʼ	17336	76.4	15851	80.2	15851	80.4	15238	75.9	15238	75.4	Latn
Uzbek, Northern (Cyrillic)	17394	77.0	9364	6.5	9364	6.6	9364	8.1	9364	7.8	Cyrl
Adyghe	17432	77.4	9483	7.8	9483	7.9	9483	9.4	9483	9.2	Cyrl
Kirghiz	17490	78.0	9390	6.8	9390	6.9	9390	8.4	9390	8.1	Cyrl
Nganasan	17527	78.4	9336	6.2	9336	6.2	9336	7.7	9336	7.5	Cyrl
Yiddish, Eastern	17589	79.0	9593	9.1	9593	9.2	8621	-0.5	8621	-0.8	Hebr
Yakut	17615	79.3	9470	7.7	9470	7.8	9470	9.3	9470	9.0	Cyrl
Khakas	17616	79.3	9554	8.6	9554	8.7	9554	10.3	9554	10.0	Cyrl
Tuva	17717	80.3	9572	8.8	9572	8.9	9572	10.5	9572	10.2	Cyrl
Russian	17750	80.6	9605	9.2	9605	9.3	9605	10.8	9605	10.6	Cyrl
Matsés	17788	81.0	17336	97.1	17336	97.3	17336	100.1	17336	99.5	Latn
Kabardian	17879	81.9	9633	9.5	9633	9.6	9633	11.2	9633	10.9	Cyrl
Inuktitut, Eastern Canadian	17910	82.3	6456	-26.6	6456	-26.5	6456	-25.5	6456	-25.7	Cans
Magahi	17920	82.4	6950	-21.0	6950	-20.9	5090	-41.3	6052	-30.3	Deva
Uyghur (Arabic)	18323	86.5	9826	11.7	9826	11.8	9826	13.4	9826	13.1	Arab
Greek (monotonic)	18324	86.5	10017	13.9	10017	14.0	10017	15.6	10017	15.3	Grek
Cherokee (cased)	18759	90.9	7245	-17.6	7245	-17.6	7245	-16.4	7245	-16.6	Cher
Cherokee (uppercase)	18759	90.9	7245	-17.6	7245	-17.6	7245	-16.4	7245	-16.6	Cher
Bhojpuri	18930	92.6	7294	-17.1	7294	-17.0	5217	-39.8	6274	-27.8	Deva
Greek (polytonic)	19555	99.0	10039	14.2	10039	14.2	10039	15.9	10039	15.6	Grek
Maithili	20047	104.0	7500	-14.7	7500	-14.7	5382	-37.9	6435	-25.9	Deva
Nepali	20816	111.8	7720	-12.2	7720	-12.2	5338	-38.4	6615	-23.9	Deva
Bengali	21349	117.2	7871	-10.5	7871	-10.4	5318	-38.6	7061	-18.7	Beng
Thai (2)	21694	120.8	7390	-16.0	7390	-15.9	5896	-32.0	5950	-31.5	Thai
Thai	21873	122.6	7479	-15.0	7479	-14.9	5992	-30.8	6043	-30.4	Thai
Gujarati	21890	122.8	8184	-6.9	8184	-6.9	5586	-35.5	6978	-19.7	Gujr
Ashéninka, Pichis	22298	126.9	22163	152.0	22163	152.2	22163	155.8	22163	155.1	Latn
Panjabi, Eastern	22584	129.8	8788	-0.1	8788	0.0	6181	-28.7	7470	-14.0	Guru
Sanskrit	22717	131.2	8171	-7.1	8171	-7.0	5186	-40.2	6544	-24.7	Deva
Sinhala	22785	131.9	8519	-3.1	8519	-3.1	6061	-30.1	6853	-21.1	Sinh
Khün	23411	138.2	8047	-8.5	8047	-8.4	4655	-46.3	5140	-40.8	Lana
Kannada	23429	138.4	8463	-3.8	8463	-3.7	5580	-35.6	6989	-19.6	Knda
Hindi	23466	138.8	8962	1.9	8962	2.0	6187	-28.6	7632	-12.2	Deva
Lao	24128	145.5	8340	-5.2	8340	-5.1	6365	-26.5	6447	-25.8	Laoo
Telugu	24993	154.3	9145	4.0	9145	4.1	6027	-30.4	7156	-17.6	Telu
Khmer, Central	25053	154.9	8619	-2.0	8619	-1.9	5511	-36.4	6791	-21.8	Khmr
Malayalam	25115	155.6	8907	1.3	8907	1.4	5286	-39.0	6762	-22.2	Mlym
Marathi	25231	156.8	9345	6.3	9345	6.3	6241	-28.0	7939	-8.6	Deva
Javanese (Javanese)	26155	166.2	8741	-0.6	8741	-0.5	5207	-39.9	6786	-21.9	Java
Georgian	26534	170.0	9742	10.8	9742	10.9	9742	12.4	9742	12.1	Geor
Chakma	27301	177.8	14231	61.8	7696	-12.4	4883	-43.6	5313	-38.8	Cakm
Pular (Adlam)	28460	189.6	14951	70.0	8233	-6.3	7435	-14.2	7435	-14.4	Adlm
Maldivian	28469	189.7	15030	70.9	15030	71.0	8449	-2.5	8449	-2.8	Thaa
Dzongkha	28504	190.1	9650	9.7	9650	9.8	7620	-12.1	7620	-12.3	Tibt
Mon	28674	191.8	10016	13.9	10016	14.0	5751	-33.6	6233	-28.3	Mymr
Sanskrit (Grantha)	29914	204.4	15418	75.3	8241	-6.2	5244	-39.5	8173	-5.9	Gran
Tamil	30208	207.4	10824	23.1	10824	23.2	6894	-20.4	9273	6.7	Taml
Tamil (Sri Lanka)	30213	207.4	10825	23.1	10825	23.2	6893	-20.5	9275	6.8	Taml
Tibetan, Central	30411	209.5	10243	16.5	10243	16.6	7958	-8.2	7958	-8.4	Tibt
Burmese	35846	264.8	12572	43.0	12572	43.1	7695	-11.2	8630	-0.7	Mymr
Shan	36130	267.7	12550	42.7	12550	42.8	8327	-3.9	8604	-1.0	Mymr
Min	4170	-57.6	2202	-75.0	2202	-74.9	2202	-74.6	4007	-53.9
Median	9827		8794		8788		8665		8688
Mean	11315	15.1	8833	0.4	8787	-0.0	8567	-1.1	8700	0.1
Max (ignoring outlier)	35846	264.8	17336	97.1	17336	97.3	17336	100.1	17336	99.5
Max	36130	267.7	22163	152.0	22163	152.2	22163	155.8	22163	155.1

It’s Not Wrong that "🤦🏼‍♂️".length == 7

But It’s Better that "🤦🏼‍♂️".len() == 17 and Rather Useless that len("🤦🏼‍♂️") == 5