encoding_rs Performance, November 2016

I’m working on a new character encoding converter library for Gecko. The new library is written in Rust and is called encoding_rs. Here are some performance numbers as of the end of November 2016.

Reading the Tables

The columns are grouped into decode results and into encode results. Those groups, in turn, are grouped into using UTF-16 as the internal Unicode representation and into using UTF-8 as the internal Unicode representation. (Both are supported in encoding_rs in order to cater for both the legacy UTF-16 needs of Gecko and for Rust code normally using UTF-8.) Then there is a column for each library whose performance is being compared with. uconv is Gecko’s current encoding converter library. ICU is ICU 55. kernel32 is kernel32.dll included in Windows 10. stdlib is Rust’s standard library. rust-encoding is rust-encoding. glibc is glibc’s iconv.

Each row names a language and an external encoding to convert from or to. The numbers are factors relative to the library named in the column. 2.0 means that encoding_rs is twice as fast as the reference library. 0.5 means that the reference library is twice as fast as encoding_rs. 0.00 means that encoding_rs is really slow and the non-zero decimals didn’t show up in the second decimal position.

Workloads

When decoding from UTF-8, the test case is the Wikipedia article for Mars, the planet, for the language in question.

Reasons for choosing Wikipedia were:

The topic Mars, the planet, was chosen, because it is the most-featured topic across the different-language Wikipedias and, indeed, had non-trivial articles in all the languages needed. Trying to choose a typical-length article for each language separately wasn’t feasible in the Wikidata data set.

When decoding from a non-UTF-8 encoding, the test case is synthetized from the UTF-8 test case by converting the Wikipedia article to the encoding in question and replacing unmappable characters with numeric character references (and in the case of Big5 removing a couple of characters that glibc couldn’t deal with).

When testing x-user-defined decode, the test case is a JPEG image, because loading binary data over XHR is the main performance-sensitive use case for x-user-defined.

Decoding JS or CSS wasn’t tested, but it’s safe to assume that the result would be faster than English, since JS and CSS tend to be 100% ASCII but English Wikipedia isn’t quite.

The encoder work loads use plain text extracts from the decoder test cases in order to simulate form submission (textarea) workloads.

The other Web-relevant case for the encoders is the parsing of URL query strings. In the absence of errors, the query strings are ASCII, so it’s safe to assume that the result would be faster than English as above.

Why These Reference Libraries?

ICU and glibc are tested in the form shipped by Ubuntu 16.04. While both Chrome and Safari use ICU, it is possible that the compiler and compiler options of ICU as shipped by Ubuntu cause the performance of ICU as tested here to differ from actual performance as used in competing browsers. Still, hopefully ICU as shipped by Ubuntu gives a ballpark understanding of performance relative to Chrome and Safari. The entry point to kernel32.dll appears to be a non-streaming API. Logically, Edge and IE need streaming encoding converters and it’s not clear if the underlying converters used by Edge and IE are the same as the converters exposed by kernel32.dll, but hopefully the comparison with kernel32.dll gives a ballpark understanding of performance relative to Edge and IE. Comparing with rust-encoding is relevant to understanding performance relative to Servo and the Rust ecosystem in general. glibc is included mainly as a matter of curiosity but could be considered relevant in terms of writing Linux apps in C using the Linux C ecosystem libraries vs. writing Linux apps in Rust using encoding_rs. The Rust standard library isn’t interesting for now, since encoding_rs delegates UTF-8 validation to the Rust standard library at the moment. Any deviation from a factor of 1.00 is more likely to be jitter in the tests than anything substantive.

Apples to Oranges Comparisons

Some the comparisons could be considered to compare things that aren’t commensurable. In particular:

Notable Observations

The Numbers

Finally, here are the numbers.

x86_64 (Haswell) with explicit SSE2 in encoding_rs

DecodeEncode
UTF-16UTF-8UTF-16UTF-8
uconv ICU kernel32 stdlib rust-encoding glibc uconv ICU kernel32 rust-encoding glibc
Arabic, UTF-8 2.01 1.81 1.05 1.09 2.36 4.19 0.65 0.72 0.70 3657.56 116.78
Czech, UTF-8 2.13 2.13 1.35 1.12 3.24 5.68 0.64 0.69 0.63 9007.22 106.67
German, UTF-8 2.37 3.92 2.01 1.04 5.91 11.02 2.25 2.32 1.12 4260.44 79.39
Greek, UTF-8 2.04 2.02 1.12 1.04 2.79 4.95 0.65 0.71 0.71 4859.56 109.65
English, UTF-8 2.39 7.23 3.16 1.15 22.21 23.22 5.51 7.00 2.79 737.67 71.95
French, UTF-8 2.26 3.09 1.69 1.19 5.78 8.67 1.06 1.11 0.78 14715.00 81.61
Hebrew, UTF-8 2.02 1.73 1.09 1.06 2.22 3.95 0.69 0.72 0.71 8624.33 120.95
Portuguese, UTF-8 2.32 3.60 1.92 1.18 6.18 9.91 1.42 1.42 0.84 5460.33 81.75
Russian, UTF-8 2.06 1.89 1.09 1.04 2.51 4.42 0.69 0.71 0.73 18626.22 110.38
Thai, UTF-8 2.47 2.48 1.33 1.12 4.49 7.19 0.78 1.04 0.64 15117.22 72.96
Turkish, UTF-8 2.10 1.97 1.31 1.09 2.89 5.15 0.70 0.74 0.66 9548.89 109.67
Vietnamese, UTF-8 2.07 1.75 1.17 1.08 2.49 4.22 0.72 0.76 0.68 25492.11 147.91
Simplified Chinese, UTF-8 2.31 2.19 1.29 1.08 3.02 5.42 0.79 1.06 0.60 3422.56 75.57
Traditional Chinese, UTF-8 2.32 3.29 1.34 1.08 3.01 5.43 0.79 1.06 0.59 3408.67 76.55
Japanese, UTF-8 2.44 2.10 1.27 0.99 2.58 5.00 0.78 1.08 0.61 2684.89 75.00
Korean, UTF-8 2.03 1.64 1.06 0.86 1.79 4.09 0.93 1.16 0.62 3751.89 106.87
Arabic, windows-1256 1.57 1.20 0.81 5.37 3.93 1.43 0.14 0.02 0.33 0.44
Czech, windows-1250 2.22 1.70 1.12 4.74 6.18 2.60 0.61 0.12 0.82 1.27
German, windows-1252 5.42 4.14 2.74 8.81 16.01 13.19 2.69 0.67 3.32 4.75
Greek, windows-1253 2.04 1.56 1.03 6.33 4.95 1.16 0.22 0.03 0.48 0.50
English, windows-1252 9.18 7.02 4.66 18.23 43.34 54.12 10.94 2.86 18.34 25.10
French, windows-1252 3.64 2.78 1.86 9.17 10.49 7.71 1.62 0.38 2.03 2.97
Hebrew, windows-1255 1.85 1.12 0.75 5.05 4.98 1.63 0.30 0.04 0.58 0.58
Portuguese, windows-1252 4.49 3.43 2.29 7.67 13.62 10.03 2.10 0.52 2.59 3.67
Russian, windows-1251 1.60 1.22 0.81 4.15 4.04 0.98 0.27 0.04 0.55 0.54
Thai, windows-874 3.13 2.40 1.58 5.64 6.85 0.50 0.08 0.01 0.22 0.21
Turkish, windows-1254 2.04 1.56 1.05 6.43 5.60 3.40 0.60 0.12 0.81 1.26
Vietnamese, windows-1258 2.31 1.77 1.19 6.83 11.97 5.74 1.12 0.25 1.43 1.74
Simplified Chinese, gb18030 3.47 2.79 4.60 6.15 4.56 0.04 0.00 0.00 0.00 0.00
Traditional Chinese, Big5 2.87 2.17 1.59 4.43 4.12 1.32 0.01 0.00 0.01 0.02
Japanese, EUC-JP 4.03 3.30 6.07 5.53 0.72 0.01 0.02 0.10
Japanese, ISO-2022-JP 1.10 2.24 2.11 1.68 0.36 0.04 0.02 0.09
Japanese, Shift_JIS 1.85 2.10 1.47 3.73 4.06 0.58 0.01 0.01 0.03 0.03
Korean, EUC-KR 40.69 3.44 2.22 4.89 4.80 0.39 0.00 0.00 0.00 0.00
x-user-defined 1.16
Arabic, UTF-16LE 0.68 0.32 2.04 0.94
Czech, UTF-16LE 0.69 0.32 1.19 0.96
German, UTF-16LE 0.68 0.32 1.25 0.99
Greek, UTF-16LE 0.68 0.32 1.19 0.95
English, UTF-16LE 0.68 0.32 1.13 0.99
French, UTF-16LE 0.68 0.32 1.20 0.98
Hebrew, UTF-16LE 0.69 0.32 1.17 0.94
Portuguese, UTF-16LE 0.68 0.32 1.12 0.99
Russian, UTF-16LE 0.69 0.32 1.19 0.94
Thai, UTF-16LE 0.68 0.32 1.20 0.98
Turkish, UTF-16LE 0.68 0.32 1.10 0.96
Vietnamese, UTF-16LE 0.69 0.32 1.09 0.94
Simplified Chinese, UTF-16LE 0.69 0.32 1.16 0.96
Traditional Chinese, UTF-16LE 0.68 0.32 1.16 0.96
Japanese, UTF-16LE 0.69 0.32 1.22 0.96
Korean, UTF-16LE 0.69 0.32 1.21 0.95
Arabic, UTF-16BE 0.65 0.30 1.32 1.00
Czech, UTF-16BE 0.64 0.30 1.17 1.02
German, UTF-16BE 0.65 0.30 1.23 1.02
Greek, UTF-16BE 0.65 0.30 1.16 1.00
English, UTF-16BE 0.64 0.30 1.11 1.02
French, UTF-16BE 0.64 0.30 1.19 1.02
Hebrew, UTF-16BE 0.64 0.30 1.16 1.00
Portuguese, UTF-16BE 0.64 0.30 1.11 1.02
Russian, UTF-16BE 0.64 0.30 1.18 0.99
Thai, UTF-16BE 0.64 0.30 1.18 1.01
Turkish, UTF-16BE 0.65 0.30 1.10 1.02
Vietnamese, UTF-16BE 0.64 0.30 1.09 1.19
Simplified Chinese, UTF-16BE 0.64 0.30 1.14 1.00
Traditional Chinese, UTF-16BE 0.65 0.30 1.14 1.00
Japanese, UTF-16BE 0.65 0.30 1.15 1.00
Korean, UTF-16BE 0.65 0.30 1.19 0.99

x86_64 (Haswell) without explicit SSE2 in encoding_rs

DecodeEncode
UTF-16UTF-8UTF-16UTF-8
uconv ICU kernel32 stdlib rust-encoding glibc uconv ICU kernel32 rust-encoding glibc
Arabic, UTF-8 1.59 1.42 0.82 1.07 2.33 4.20 0.62 0.68 0.67 3657.56 117.08
Czech, UTF-8 1.48 1.49 0.94 1.12 3.23 5.69 0.72 0.77 0.70 9007.22 108.03
German, UTF-8 1.04 1.72 0.88 1.18 6.71 10.77 1.45 1.50 0.73 4260.44 80.71
Greek, UTF-8 1.50 1.49 0.83 1.03 2.76 4.93 0.61 0.67 0.67 4859.56 112.11
English, UTF-8 0.63 1.91 0.83 1.14 22.12 23.19 1.73 2.20 0.87 737.67 73.06
French, UTF-8 1.21 1.65 0.91 1.17 5.67 8.63 1.02 1.07 0.75 14715.00 81.77
Hebrew, UTF-8 1.62 1.39 0.87 1.06 2.21 3.93 0.65 0.68 0.67 8624.33 121.95
Portuguese, UTF-8 1.09 1.70 0.91 1.16 6.06 9.85 1.19 1.19 0.71 5460.33 81.64
Russian, UTF-8 1.58 1.45 0.84 1.04 2.50 4.38 0.65 0.66 0.67 18626.22 106.78
Thai, UTF-8 1.63 1.64 0.88 1.07 4.31 6.94 0.71 0.94 0.58 15117.22 70.17
Turkish, UTF-8 1.55 1.45 0.96 1.08 2.87 5.14 0.79 0.84 0.75 9548.89 109.61
Vietnamese, UTF-8 1.59 1.34 0.89 1.07 2.47 4.21 0.71 0.75 0.67 25492.11 150.07
Simplified Chinese, UTF-8 1.60 1.51 0.89 1.09 3.03 5.45 0.72 0.96 0.54 3422.56 77.90
Traditional Chinese, UTF-8 1.61 2.28 0.93 0.80 2.22 5.46 0.72 0.96 0.54 3408.67 76.91
Japanese, UTF-8 1.80 1.55 0.94 1.07 2.78 5.07 0.70 0.98 0.55 2684.89 75.96
Korean, UTF-8 1.54 1.24 0.81 1.07 2.23 4.08 0.85 1.05 0.56 3751.89 113.51
Arabic, windows-1256 1.11 0.85 0.57 4.13 3.02 1.42 0.14 0.02 0.34 0.45
Czech, windows-1250 1.45 1.11 0.73 3.45 4.07 2.76 0.65 0.12 0.83 1.29
German, windows-1252 1.95 1.49 0.99 4.48 5.36 10.24 2.09 0.52 2.47 3.53
Greek, windows-1253 1.28 0.98 0.65 4.34 3.40 1.16 0.22 0.03 0.49 0.52
English, windows-1252 2.14 1.63 1.09 5.81 6.86 19.19 3.88 1.01 4.34 5.83
French, windows-1252 1.78 1.36 0.91 4.91 4.90 7.49 1.58 0.37 1.94 2.84
Hebrew, windows-1255 1.35 0.82 0.55 3.93 3.88 1.63 0.30 0.04 0.61 0.61
Portuguese, windows-1252 1.88 1.44 0.96 4.10 5.20 8.86 1.85 0.46 2.25 3.18
Russian, windows-1251 1.13 0.86 0.57 3.31 3.04 0.98 0.27 0.04 0.57 0.57
Thai, windows-874 1.61 1.23 0.81 3.82 4.00 0.46 0.07 0.01 0.22 0.21
Turkish, windows-1254 1.41 1.07 0.72 4.46 3.88 3.58 0.63 0.13 0.83 1.28
Vietnamese, windows-1258 1.50 1.14 0.77 4.43 7.76 6.12 1.19 0.27 1.47 1.79
Simplified Chinese, gb18030 2.26 1.82 3.00 4.34 3.29 0.04 0.00 0.00 0.00 0.00
Traditional Chinese, Big5 2.13 1.62 1.18 3.48 3.22 1.31 0.01 0.00 0.01 0.02
Japanese, EUC-JP 2.73 2.24 4.38 3.94 0.72 0.01 0.02 0.10
Japanese, ISO-2022-JP 1.03 2.11 2.12 1.69 0.36 0.04 0.02 0.09
Japanese, Shift_JIS 1.47 1.66 1.17 3.67 3.21 0.58 0.01 0.01 0.03 0.03
Korean, EUC-KR 26.69 2.26 1.46 3.38 3.19 0.35 0.00 0.00 0.00 0.00
x-user-defined 1.16
Arabic, UTF-16LE 0.68 0.32 1.99 0.91
Czech, UTF-16LE 0.69 0.32 1.18 0.96
German, UTF-16LE 0.68 0.32 1.24 0.98
Greek, UTF-16LE 0.68 0.32 1.16 0.93
English, UTF-16LE 0.68 0.32 1.12 0.99
French, UTF-16LE 0.68 0.32 1.19 0.98
Hebrew, UTF-16LE 0.68 0.32 1.15 0.92
Portuguese, UTF-16LE 0.68 0.32 1.11 0.98
Russian, UTF-16LE 0.68 0.32 1.16 0.92
Thai, UTF-16LE 0.67 0.32 1.19 0.97
Turkish, UTF-16LE 0.68 0.32 1.10 0.95
Vietnamese, UTF-16LE 0.68 0.32 1.09 0.94
Simplified Chinese, UTF-16LE 0.68 0.32 1.15 0.95
Traditional Chinese, UTF-16LE 0.68 0.32 1.15 0.95
Japanese, UTF-16LE 0.54 0.25 1.21 0.95
Korean, UTF-16LE 0.68 0.32 1.20 0.94
Arabic, UTF-16BE 0.65 0.30 1.30 0.98
Czech, UTF-16BE 0.64 0.30 1.14 1.00
German, UTF-16BE 0.65 0.30 1.21 1.01
Greek, UTF-16BE 0.65 0.30 1.14 0.99
English, UTF-16BE 0.64 0.30 1.10 1.02
French, UTF-16BE 0.64 0.30 1.17 1.00
Hebrew, UTF-16BE 0.64 0.30 1.14 0.98
Portuguese, UTF-16BE 0.64 0.30 1.09 1.01
Russian, UTF-16BE 0.64 0.30 1.16 0.98
Thai, UTF-16BE 0.64 0.30 1.16 1.00
Turkish, UTF-16BE 0.65 0.30 1.08 1.00
Vietnamese, UTF-16BE 0.64 0.30 1.07 1.17
Simplified Chinese, UTF-16BE 0.64 0.30 1.13 1.00
Traditional Chinese, UTF-16BE 0.65 0.30 1.13 1.00
Japanese, UTF-16BE 0.65 0.30 1.14 0.99
Korean, UTF-16BE 0.65 0.30 1.18 0.99

ARMv7 code running on an ARMv8 CPU (Raspberry Pi 3) without explicit NEON in encoding_rs

DecodeEncode
UTF-16UTF-8UTF-16UTF-8
uconv ICU kernel32 stdlib rust-encoding glibc uconv ICU kernel32 rust-encoding glibc
Arabic, UTF-8 1.21 1.45 1.00 4.80 5.33 0.65 0.69 3147.96 119.19
Czech, UTF-8 0.76 0.89 1.00 7.84 7.79 1.02 0.85 5926.37 104.63
German, UTF-8 0.64 0.86 1.00 13.45 11.42 1.76 2.73 11320.06 96.01
Greek, UTF-8 1.09 1.34 1.00 5.76 6.10 0.64 1.38 8810.62 120.38
English, UTF-8 0.58 0.86 0.99 20.58 13.35 1.96 3.06 4057.53 90.11
French, UTF-8 0.65 0.85 1.00 12.20 9.37 1.50 1.17 12590.74 47.34
Hebrew, UTF-8 1.16 1.38 1.00 4.73 5.21 0.66 0.70 6779.61 58.82
Portuguese, UTF-8 0.63 0.87 1.00 13.57 10.64 1.60 1.25 5584.95 97.60
Russian, UTF-8 1.18 1.46 1.00 5.37 5.42 0.64 0.69 15840.13 116.02
Thai, UTF-8 1.18 1.26 1.00 6.78 3.29 0.66 1.06 22381.14 98.15
Turkish, UTF-8 0.78 0.90 1.01 7.23 7.03 1.07 0.89 6750.38 104.90
Vietnamese, UTF-8 0.91 0.98 1.00 5.76 5.50 0.82 0.78 13237.81 112.05
Simplified Chinese, UTF-8 1.04 1.11 1.00 5.90 12.15 0.67 1.06 4373.86 104.74
Traditional Chinese, UTF-8 2.03 1.09 1.00 5.91 6.11 0.67 1.06 4373.17 102.60
Japanese, UTF-8 2.37 1.20 1.00 5.05 5.34 0.66 1.06 4519.20 102.65
Korean, UTF-8 1.13 1.05 1.00 4.80 5.33 0.70 1.02 2906.59 109.17
Arabic, windows-1256 1.03 0.79 5.16 3.10 2.27 0.19 0.56 0.66
Czech, windows-1250 1.51 1.17 3.87 4.11 4.88 0.66 1.30 1.85
German, windows-1252 1.73 1.34 4.04 4.75 13.41 3.36 5.71 4.58
Greek, windows-1253 1.12 0.86 5.34 3.34 2.20 0.58 1.71 0.76
English, windows-1252 1.68 1.30 4.61 4.95 17.93 4.41 3.63 5.84
French, windows-1252 1.65 1.25 4.31 4.56 11.21 1.43 2.52 1.96
Hebrew, windows-1255 1.04 0.80 5.16 3.68 3.35 0.37 1.04 0.81
Portuguese, windows-1252 1.65 1.28 4.51 4.65 13.03 1.65 2.84 4.47
Russian, windows-1251 1.05 0.81 4.94 3.11 1.84 0.34 0.97 0.85
Thai, windows-874 1.26 0.97 4.69 3.82 0.90 0.11 0.37 0.34
Turkish, windows-1254 1.47 1.13 4.62 4.09 5.49 0.63 1.25 1.79
Vietnamese, windows-1258 1.52 1.18 5.10 5.55 10.21 1.22 2.13 2.18
Simplified Chinese, gb18030 1.99 1.71 4.40 2.38 0.05 0.00 0.00 0.00
Traditional Chinese, Big5 2.64 1.56 4.05 2.71 0.85 0.00 0.01 0.02
Japanese, EUC-JP 2.29 1.75 4.08 2.57 0.41 0.01 0.02 0.10
Japanese, ISO-2022-JP 0.68 1.54 2.01 1.14 0.23 0.03 0.02 0.14
Japanese, Shift_JIS 1.57 1.48 4.25 2.43 0.24 0.00 0.02 0.01
Korean, EUC-KR 26.55 1.85 3.90 2.48 0.25 0.00 0.00 0.00
x-user-defined 1.43
Arabic, UTF-16LE 0.47 0.30 1.34 0.63
Czech, UTF-16LE 0.47 0.30 1.19 0.58
German, UTF-16LE 0.47 0.30 0.95 0.58
Greek, UTF-16LE 0.93 0.30 1.24 0.62
English, UTF-16LE 0.47 0.31 1.00 0.57
French, UTF-16LE 0.47 0.30 1.00 0.58
Hebrew, UTF-16LE 0.47 0.30 1.30 0.63
Portuguese, UTF-16LE 0.47 0.31 0.98 0.58
Russian, UTF-16LE 0.47 0.31 1.37 0.63
Thai, UTF-16LE 0.47 0.31 1.23 0.63
Turkish, UTF-16LE 0.47 0.30 0.99 0.59
Vietnamese, UTF-16LE 0.47 0.31 1.06 0.59
Simplified Chinese, UTF-16LE 0.47 0.30 1.11 0.61
Traditional Chinese, UTF-16LE 0.24 0.15 1.11 0.61
Japanese, UTF-16LE 0.47 0.30 1.19 0.62
Korean, UTF-16LE 0.70 0.30 1.18 0.62
Arabic, UTF-16BE 0.53 0.27 1.28 0.79
Czech, UTF-16BE 0.53 0.28 0.92 0.74
German, UTF-16BE 0.53 0.27 0.87 0.73
Greek, UTF-16BE 0.27 0.14 1.18 0.77
English, UTF-16BE 0.53 0.28 0.93 0.73
French, UTF-16BE 0.53 0.28 0.93 0.73
Hebrew, UTF-16BE 0.53 0.27 2.49 0.39
Portuguese, UTF-16BE 0.53 0.28 0.91 0.73
Russian, UTF-16BE 0.53 0.28 1.31 0.78
Thai, UTF-16BE 0.53 0.28 1.16 0.78
Turkish, UTF-16BE 0.53 0.27 0.92 0.74
Vietnamese, UTF-16BE 0.53 0.28 0.99 0.75
Simplified Chinese, UTF-16BE 0.53 0.28 1.04 0.76
Traditional Chinese, UTF-16BE 0.53 0.27 1.04 0.76
Japanese, UTF-16BE 0.53 0.27 1.12 0.77
Korean, UTF-16BE 0.52 0.26 1.11 0.76
<