Broken UTF-8
Any copyright to this file is dedicated to the Public Domain. https://creativecommons.org/publicdomain/zero/1.0/
Five-byte and six-byte sequences were defined in RFC 2297 but are no longer part of the UTF-8 definition.
Non-shortest forms for lowest single-byte (U+0000)
- Two-byte sequence (C0 80)
- ��
- Three-byte sequence (E0 80 80)
- ��
- Four-byte sequence (F0 80 80 80)
- ��
- Five-byte sequence (F8 80 80 80 80)
- �����
- Six-byte sequence (FC 80 80 80 80 80)
- ������
Non-shortest forms for highest single-byte (U+007F)
- Two-byte sequence (C1 BF)
- ��
- Three-byte sequence (E0 81 BF)
- ��
- Four-byte sequence (F0 80 81 BF)
- ��
- Five-byte sequence (F8 80 80 81 BF)
- �����
- Six-byte sequence (FC 80 80 80 81 BF)
- ������
Non-shortest forms for lowest two-byte (U+0080)
- Three-byte sequence (E0 82 80)
- ��
- Four-byte sequence (F0 80 82 80)
- ��
- Five-byte sequence (F8 80 80 82 80)
- �����
- Six-byte sequence (FC 80 80 80 82 80)
- ������
Non-shortest forms for highest two-byte (U+07FF)
- Three-byte sequence (E0 9F BF)
- ��
- Four-byte sequence (F0 80 9F BF)
- ��
- Five-byte sequence (F8 80 80 9F BF)
- �����
- Six-byte sequence (FC 80 80 80 9F BF)
- ������
Non-shortest forms for lowest three-byte (U+0800)
- Four-byte sequence (F0 80 A0 80)
- ��
- Five-byte sequence (F8 80 80 A0 80)
- �����
- Six-byte sequence (FC 80 80 80 A0 80)
- ������
Non-shortest forms for highest three-byte (U+FFFF)
- Four-byte sequence (F0 8F BF BF)
- ��
- Five-byte sequence (F8 80 8F BF BF)
- �����
- Six-byte sequence (FC 80 80 8F BF BF)
- ������
Non-shortest forms for lowest four-byte (U+10000)
- Five-byte sequence (F8 80 90 80 80)
- �����
- Six-byte sequence (FC 80 80 90 80 80)
- ������
Non-shortest forms for last Unicode (U+10FFFF)
- Five-byte sequence (F8 84 8F BF BF)
- �����
- Six-byte sequence (FC 80 84 8F BF BF)
- ������
Out of range
- One past Unicode (F4 90 80 80)
- ��
- Longest five-byte sequence (FB BF BF BF BF)
- �����
- Longest six-byte sequence (FD BF BF BF BF BF)
- ������
- First surrogate (ED A0 80)
- í €
- Last surrogate (ED BF BF)
- í¿¿
- CESU-8 surrogate pair (ED A0 BD ED B2 A9)
- 💩
Out of range and non-shortest
- One past Unicode as five-byte sequence (F8 84 90 80 80)
- �����
- One past Unicode as six-byte sequence (FC 80 84 90 80 80)
- ������
- First surrogate as four-byte sequence (F0 8D A0 80)
- ��
- Last surrogate as four-byte sequence (F0 8D BF BF)
- ��
- CESU-8 surrogate pair as two four-byte overlongs (F0 8D A0 BD F0 8D B2 A9)
- ����
Lone trails
- One (80)
- �
- Two (80 80)
- ��
- Three (80 80 80)
- ���
- Four (80 80 80 80)
- ����
- Five (80 80 80 80 80)
- �����
- Six (80 80 80 80 80 80)
- ������
- Seven (80 80 80 80 80 80 80)
- �������
- After valid two-byte (C2 B6 80)
- ¶�
- After valid three-byte (E2 98 83 80)
- ☃�
- After valid four-byte (F0 9F 92 A9 80)
- 💩�
- After five-byte (FB BF BF BF BF 80)
- ������
- After six-byte (FD BF BF BF BF BF 80)
- �������
Truncated sequences
- Two-byte lead (C2)
- �
- Three-byte lead (E2)
- �
- Three-byte lead and one trail (E2 98)
- �
- Four-byte lead (F0)
- �
- Four-byte lead and one trail (F0 9F)
- �
- Four-byte lead and two trails (F0 9F 92)
- �
Leftovers
- FE (FE)
- �
- FE and trail (FE 80)
- ��
- FF (FF)
- �
- FF and trail (FF 80)
- ��