Broken UTF-8

Any copyright to this file is dedicated to the Public Domain. https://creativecommons.org/publicdomain/zero/1.0/

Five-byte and six-byte sequences were defined in RFC 2297 but are no longer part of the UTF-8 definition.

Non-shortest forms for lowest single-byte (U+0000)

Two-byte sequence (C0 80)

Three-byte sequence (E0 80 80)
à€€
Four-byte sequence (F0 80 80 80)
đ€€€
Five-byte sequence (F8 80 80 80 80)
ű€€€€
Six-byte sequence (FC 80 80 80 80 80)
ü€€€€€

Non-shortest forms for highest single-byte (U+007F)

Two-byte sequence (C1 BF)
Áż
Three-byte sequence (E0 81 BF)
àż
Four-byte sequence (F0 80 81 BF)
đ€ż
Five-byte sequence (F8 80 80 81 BF)
ű€€ż
Six-byte sequence (FC 80 80 80 81 BF)
ü€€€ż

Non-shortest forms for lowest two-byte (U+0080)

Three-byte sequence (E0 82 80)
à‚€
Four-byte sequence (F0 80 82 80)
đ€‚€
Five-byte sequence (F8 80 80 82 80)
ű€€‚€
Six-byte sequence (FC 80 80 80 82 80)
ü€€€‚€

Non-shortest forms for highest two-byte (U+07FF)

Three-byte sequence (E0 9F BF)
àŸż
Four-byte sequence (F0 80 9F BF)
đ€Ÿż
Five-byte sequence (F8 80 80 9F BF)
ű€€Ÿż
Six-byte sequence (FC 80 80 80 9F BF)
ü€€€Ÿż

Non-shortest forms for lowest three-byte (U+0800)

Four-byte sequence (F0 80 A0 80)
đ€ €
Five-byte sequence (F8 80 80 A0 80)
ű€€ €
Six-byte sequence (FC 80 80 80 A0 80)
ü€€€ €

Non-shortest forms for highest three-byte (U+FFFF)

Four-byte sequence (F0 8F BF BF)
đżż
Five-byte sequence (F8 80 8F BF BF)
ű€żż
Six-byte sequence (FC 80 80 8F BF BF)
ü€€żż

Non-shortest forms for lowest four-byte (U+10000)

Five-byte sequence (F8 80 90 80 80)
ű€€€
Six-byte sequence (FC 80 80 90 80 80)
ü€€€€

Non-shortest forms for last Unicode (U+10FFFF)

Five-byte sequence (F8 84 8F BF BF)
ű„żż
Six-byte sequence (FC 80 84 8F BF BF)
ü€„żż

Out of range

One past Unicode (F4 90 80 80)
ô€€
Longest five-byte sequence (FB BF BF BF BF)
ûżżżż
Longest six-byte sequence (FD BF BF BF BF BF)
ężżżżż
First surrogate (ED A0 80)
í €
Last surrogate (ED BF BF)
íżż
CESU-8 surrogate pair (ED A0 BD ED B2 A9)
í œíČ©

Out of range and non-shortest

One past Unicode as five-byte sequence (F8 84 90 80 80)
ű„€€
One past Unicode as six-byte sequence (FC 80 84 90 80 80)
ü€„€€
First surrogate as four-byte sequence (F0 8D A0 80)
đ €
Last surrogate as four-byte sequence (F0 8D BF BF)
đżż
CESU-8 surrogate pair as two four-byte overlongs (F0 8D A0 BD F0 8D B2 A9)
đ œđČ©

Lone trails

One (80)
€
Two (80 80)
€€
Three (80 80 80)
€€€
Four (80 80 80 80)
€€€€
Five (80 80 80 80 80)
€€€€€
Six (80 80 80 80 80 80)
€€€€€€
Seven (80 80 80 80 80 80 80)
€€€€€€€
After valid two-byte (C2 B6 80)
¶€
After valid three-byte (E2 98 83 80)
☃€
After valid four-byte (F0 9F 92 A9 80)
đŸ’©€
After five-byte (FB BF BF BF BF 80)
ûżżżż€
After six-byte (FD BF BF BF BF BF 80)
ężżżżż€

Truncated sequences

Two-byte lead (C2)
Â
Three-byte lead (E2)
â
Three-byte lead and one trail (E2 98)
â˜
Four-byte lead (F0)
đ
Four-byte lead and one trail (F0 9F)
đŸ
Four-byte lead and two trails (F0 9F 92)
đŸ’

Leftovers

FE (FE)
ț
FE and trail (FE 80)
ț€
FF (FF)
ÿ
FF and trail (FF 80)
ÿ€