Broken UTF-8

Any copyright to this file is dedicated to the Public Domain. https://creativecommons.org/publicdomain/zero/1.0/

Five-byte and six-byte sequences were defined in RFC 2297 but are no longer part of the UTF-8 definition.

Non-shortest forms for lowest single-byte (U+0000)

Two-byte sequence (C0 80)
��
Three-byte sequence (E0 80 80)
��
Four-byte sequence (F0 80 80 80)
���
Five-byte sequence (F8 80 80 80 80)
�����
Six-byte sequence (FC 80 80 80 80 80)
������

Non-shortest forms for highest single-byte (U+007F)

Two-byte sequence (C1 BF)
��
Three-byte sequence (E0 81 BF)
��
Four-byte sequence (F0 80 81 BF)
���
Five-byte sequence (F8 80 80 81 BF)
�����
Six-byte sequence (FC 80 80 80 81 BF)
������

Non-shortest forms for lowest two-byte (U+0080)

Three-byte sequence (E0 82 80)
��
Four-byte sequence (F0 80 82 80)
���
Five-byte sequence (F8 80 80 82 80)
�����
Six-byte sequence (FC 80 80 80 82 80)
������

Non-shortest forms for highest two-byte (U+07FF)

Three-byte sequence (E0 9F BF)
��
Four-byte sequence (F0 80 9F BF)
���
Five-byte sequence (F8 80 80 9F BF)
�����
Six-byte sequence (FC 80 80 80 9F BF)
������

Non-shortest forms for lowest three-byte (U+0800)

Four-byte sequence (F0 80 A0 80)
���
Five-byte sequence (F8 80 80 A0 80)
�����
Six-byte sequence (FC 80 80 80 A0 80)
������

Non-shortest forms for highest three-byte (U+FFFF)

Four-byte sequence (F0 8F BF BF)
���
Five-byte sequence (F8 80 8F BF BF)
�����
Six-byte sequence (FC 80 80 8F BF BF)
������

Non-shortest forms for lowest four-byte (U+10000)

Five-byte sequence (F8 80 90 80 80)
�����
Six-byte sequence (FC 80 80 90 80 80)
������

Non-shortest forms for last Unicode (U+10FFFF)

Five-byte sequence (F8 84 8F BF BF)
�����
Six-byte sequence (FC 80 84 8F BF BF)
������

Out of range

One past Unicode (F4 90 80 80)
���
Longest five-byte sequence (FB BF BF BF BF)
�����
Longest six-byte sequence (FD BF BF BF BF BF)
������
First surrogate (ED A0 80)
��
Last surrogate (ED BF BF)
��
CESU-8 surrogate pair (ED A0 BD ED B2 A9)
����

Out of range and non-shortest

One past Unicode as five-byte sequence (F8 84 90 80 80)
�����
One past Unicode as six-byte sequence (FC 80 84 90 80 80)
������
First surrogate as four-byte sequence (F0 8D A0 80)
���
Last surrogate as four-byte sequence (F0 8D BF BF)
���
CESU-8 surrogate pair as two four-byte overlongs (F0 8D A0 BD F0 8D B2 A9)
������

Lone trails

One (80)
Two (80 80)
��
Three (80 80 80)
���
Four (80 80 80 80)
����
Five (80 80 80 80 80)
�����
Six (80 80 80 80 80 80)
������
Seven (80 80 80 80 80 80 80)
�������
After valid two-byte (C2 B6 80)
¶�
After valid three-byte (E2 98 83 80)
☃�
After valid four-byte (F0 9F 92 A9 80)
💩�
After five-byte (FB BF BF BF BF 80)
������
After six-byte (FD BF BF BF BF BF 80)
�������

Truncated sequences

Two-byte lead (C2)
Three-byte lead (E2)
Three-byte lead and one trail (E2 98)
Four-byte lead (F0)
Four-byte lead and one trail (F0 9F)
Four-byte lead and two trails (F0 9F 92)

Leftovers

FE (FE)
FE and trail (FE 80)
��
FF (FF)
FF and trail (FF 80)
��