Broken UTF-8
Any copyright to this file is dedicated to the Public Domain. https://creativecommons.org/publicdomain/zero/1.0/
Five-byte and six-byte sequences were defined in RFC 2297 but are no longer part of the UTF-8 definition.
Non-shortest forms for lowest single-byte (U+0000)
- Two-byte sequence (C0 80)
- À
- Three-byte sequence (E0 80 80)
- à
- Four-byte sequence (F0 80 80 80)
- đ
- Five-byte sequence (F8 80 80 80 80)
- ű
- Six-byte sequence (FC 80 80 80 80 80)
- ü
Non-shortest forms for highest single-byte (U+007F)
- Two-byte sequence (C1 BF)
- Áż
- Three-byte sequence (E0 81 BF)
- àż
- Four-byte sequence (F0 80 81 BF)
- đż
- Five-byte sequence (F8 80 80 81 BF)
- űż
- Six-byte sequence (FC 80 80 80 81 BF)
- üż
Non-shortest forms for lowest two-byte (U+0080)
- Three-byte sequence (E0 82 80)
- à
- Four-byte sequence (F0 80 82 80)
- đ
- Five-byte sequence (F8 80 80 82 80)
- ű
- Six-byte sequence (FC 80 80 80 82 80)
- ü
Non-shortest forms for highest two-byte (U+07FF)
- Three-byte sequence (E0 9F BF)
- àż
- Four-byte sequence (F0 80 9F BF)
- đż
- Five-byte sequence (F8 80 80 9F BF)
- űż
- Six-byte sequence (FC 80 80 80 9F BF)
- üż
Non-shortest forms for lowest three-byte (U+0800)
- Four-byte sequence (F0 80 A0 80)
- đ
- Five-byte sequence (F8 80 80 A0 80)
- ű
- Six-byte sequence (FC 80 80 80 A0 80)
- ü
Non-shortest forms for highest three-byte (U+FFFF)
- Four-byte sequence (F0 8F BF BF)
- đżż
- Five-byte sequence (F8 80 8F BF BF)
- űżż
- Six-byte sequence (FC 80 80 8F BF BF)
- üżż
Non-shortest forms for lowest four-byte (U+10000)
- Five-byte sequence (F8 80 90 80 80)
- ű
- Six-byte sequence (FC 80 80 90 80 80)
- ü
Non-shortest forms for last Unicode (U+10FFFF)
- Five-byte sequence (F8 84 8F BF BF)
- űżż
- Six-byte sequence (FC 80 84 8F BF BF)
- üżż
Out of range
- One past Unicode (F4 90 80 80)
- ô
- Longest five-byte sequence (FB BF BF BF BF)
- ûżżżż
- Longest six-byte sequence (FD BF BF BF BF BF)
- ężżżżż
- First surrogate (ED A0 80)
- í
- Last surrogate (ED BF BF)
- íżż
- CESU-8 surrogate pair (ED A0 BD ED B2 A9)
- í œíČ©
Out of range and non-shortest
- One past Unicode as five-byte sequence (F8 84 90 80 80)
- ű
- One past Unicode as six-byte sequence (FC 80 84 90 80 80)
- ü
- First surrogate as four-byte sequence (F0 8D A0 80)
- đ
- Last surrogate as four-byte sequence (F0 8D BF BF)
- đżż
- CESU-8 surrogate pair as two four-byte overlongs (F0 8D A0 BD F0 8D B2 A9)
- đ œđČ©
Lone trails
- One (80)
-
- Two (80 80)
-
- Three (80 80 80)
-
- Four (80 80 80 80)
-
- Five (80 80 80 80 80)
-
- Six (80 80 80 80 80 80)
-
- Seven (80 80 80 80 80 80 80)
-
- After valid two-byte (C2 B6 80)
- ¶
- After valid three-byte (E2 98 83 80)
- â
- After valid four-byte (F0 9F 92 A9 80)
- đ©
- After five-byte (FB BF BF BF BF 80)
- ûżżżż
- After six-byte (FD BF BF BF BF BF 80)
- ężżżżż
Truncated sequences
- Two-byte lead (C2)
- Â
- Three-byte lead (E2)
- â
- Three-byte lead and one trail (E2 98)
- â
- Four-byte lead (F0)
- đ
- Four-byte lead and one trail (F0 9F)
- đ
- Four-byte lead and two trails (F0 9F 92)
- đ
Leftovers
- FE (FE)
- ț
- FE and trail (FE 80)
- ț
- FF (FF)
- ÿ
- FF and trail (FF 80)
- ÿ