It’s Time to Stop Adding New Features for Non-Unicode Execution Encodings in C++

Henri Sivonen, 2019-04-24

Disclosure: I work for Mozilla, and my professional activity includes being the Gecko module owner for character encodings.

Disclaimer: Even though this document links to code and documents written as part of my Mozilla actitivities, this document is written in personal capacity.

Summary

Text processing facilities in the C++ standard library have been mostly agnostic of the actual character encoding of text. The few operations that are sensitive to the actual character encoding are defined to behave according to the implementation-defined “narrow execution encoding” (for buffers of char) and the implementation-defined “wide execution encoding” (for buffers of wchar_t).

Meanwhile, over the last two decades, a different dominant design has arisen for text processing in other programming languages as well as in C and C++ usage despite what the C and C++ standard-library facilities provide: Representing text as Unicode, and only Unicode, internally in the application even if some other representation is required externally for backward compatibility.

I think the C++ standard should adopt the approach of “Unicode-only internally” for new text processing facilities and should not support non-Unicode execution encodings in newly-introduced features. This allows new features to have less abstraction obfuscation for Unicode usage, avoids digging legacy applications deeper into non-Unicode commitment, and avoids the specification and implementation effort of adapting new features to make sense for non-Unicode execution encodings.

Concretely, I suggest:

In new features, do not support numbers other than Unicode scalar values as a numbering scheme for abstract characters, and design new APIs to be aware of Unicode scalar values as appropriate instead of allowing other numbering schemes. (I.e. make Unicode the only coded character set supported for new features.)
Use char32_t directly as the concrete type for an individual Unicode scalar value without allowing for parametrization of the type that conceptually represents a Unicode scalar value. (For sequences of Unicode scalar values, UTF-8 is preferred.)
When introducing new text processing facilities (other than the next item on this list), support only UTF in-memory text representations: UTF-8 and, potentially, depending on feature, also UTF-16 or also UTF-16 and UTF-32. That is, do not seek to make new text processing features applicable to non-UTF execution encodings. (This document should not be taken as a request to add features for UTF-16 or UTF-32 beyond iteration over string views by scalar value. To avoid distraction from the main point, this document should also not be taken as advocating against providing any particular feature for UTF-16 or UTF-32.)
Non-UTF character encodings may be supported in a conversion API whose purpose is to convert from a legacy encoding into a UTF-only representation near the IO boundary or at the boundary between a legacy part (that relies on execution encoding) and a new part (that uses Unicode) of an application. Such APIs should be std::span-based instead of iterator-based.
When an operation logically requires a valid sequence of Unicode scalar values, the API must either define the operation to fail upon encountering invalid UTF-8/16/32 or must replace each error with a U+FFFD REPLACEMENT CHARACTER as follows: What constitutes a single error in UTF-8 is defined in the WHATWG Encoding Standard (which matches the “best practice” from the Unicode Standard). In UTF-16, each unpaired surrogate is an error. In UTF-32, each code unit whose numeric value isn’t a valid Unicode scalar value is an error.
Instead of standardizing Text_view as proposed, standardize a way to obtain a Unicode scalar value iterator from std::u8string_view, std::u16string_view, and std::u32string_view.

Context

This write-up is in response to (and in disagreement with) the “Character Types” section in the P0244R2 Text_view paper:

This library defines a character class template parameterized by character set type used to represent character values. The purpose of this class template is to make explicit the association of a code point value and a character set.

It has been suggested that char32_t be supported as a character type that is implicitly associated with the Unicode character set and that values of this type always be interpreted as Unicode code point values. This suggestion is intended to enable UTF-32 string literals to be directly usable as sequences of character values (in addition to being sequences of code unit and code point values). This has a cost in that it prohibits use of the char32_t type as a code unit or code point type for other encodings. Non-Unicode encodings, including the encodings used for ordinary and wide string literals, would still require a distinct character type (such as a specialization of the character class template) so that the correct character set can be inferred from objects of the character type.

This suggestion raises concerns for the author. To a certain degree, it can be accommodated by removing the current members of the character class template in favor of free functions and type trait templates. However, it results in ambiguities when enumerating the elements of a UTF-32 string literal; are the elements code point or character values? Well, the answer would be both (and code unit values as well). This raises the potential for inadvertently writing (generic) code that confuses code points and characters, runs as expected for UTF-32 encodings, but fails to compile for other encodings. The author would prefer to enforce correct code via the type system and is unaware of any particular benefits that the ability to treat UTF-32 string literals as sequences of character type would bring.

It has also been suggested that char32_t might suffice as the only character type; that decoding of any encoded string include implicit transcoding to Unicode code points. The author believes that this suggestion is not feasible for several reasons:

Some encodings use character sets that define characters such that round trip transcoding to Unicode and back fails to preserve the original code point value. For example, Shift-JIS (Microsoft code page 932) defines duplicate code points for the same character for compatibility with IBM and NEC character set extensions.
https://support.microsoft.com/en-us/kb/170559 [sic; dead link]

Transcoding to Unicode for all non-Unicode encodings would carry non-negligible performance costs and would pessimize platforms such as IBM’s z/OS that use EBCIDC by default for the non-Unicode execution character sets.

To summarize, it raises three concerns:

Ambiguity between code units and scalar values (the paper says “code points”, but I say “scalar values” to emphasize the exclusion of surrogates) in the UTF-32 case.
Some encodings, particularly Microsoft code page 932, can represent one Unicode scalar value in more than one way, so the distinction of which way does not round-trip.
Transcoding non-Unicode execution encodings has a performance cost that pessimizes particularly IBM z/OS.

Terminology and Background

(This section and the next section should not be taken as ’splaining to SG16 what they already know. The over-explaining is meant to make this document more coherent for a broader audience of readers who might be interested in C++ standardization without full familiarity with text processing terminology or background, or the details of Microsoft code page 932.)

An abstract character is an atomic unit of text. Depending on writing system, the analysis of what constitutes an atomic unit may differ, but a given implementation on a computer has to identify some things as atomic units. Unicode’s opinion of what is an abstract character is the most widely applied opinion. In fact, Unicode itself has multiple opinions on this, and Unicode Normalization Forms bridge these multiple opinions.

A character set is a set of abstract characters. In principle, a set of characters can be defined without assigning numbers to them.

A coded character set assigns numbers, code points, to the items in the character set to each abstract character.

When the Unicode code space was extended beyond the Basic Multilingual Plane, some code points were set aside for the UTF-16 surrogate mechanism and, therefore, do not represent abstract characters. A Unicode scalar value is a Unicode code point that is not a surrogate code point. For consistency with Unicode, I use the term scalar value below when referring to non-Unicode coded character sets, too.

A character encoding is a way to represent a conceptual sequence of scalar values from one or more coded character sets as a concrete sequence of bytes. The bytes are called code units. Unicode defines in-memory Unicode encoding forms whose code unit is not a byte: UTF-16 and UTF-32. (For these Unicode encoding forms, there are corresponding Unicode encoding schemes that use byte code units and represent a non-byte code unit from a correspoding encoding form as multiple bytes and, therefore, could be used in byte-oriented IO even though UTF-8 is preferred for interchange. UTF-8, of course, uses byte code units as both a Unicode encoding form and as a Unicode encoding scheme.)

Coded character sets that assign scalar values in the range 0...255 (decimal) can be considered to trivially imply a character encoding for themselves: You just store the scalar value as an unsigned byte value. (Often such coded character sets import US-ASCII as the lower half.)

However, it is possible to define less obvious encodings even for character sets that only have up to 256 characters. IBM has several EBCDIC character encodings for the set of characters defined in ISO-8859-1. That is, compared to the trivial ISO-8859-1 encoding (the original, not the Web alias for windows-1252), these EBCDIC encodings permute the byte value assignments.

Unicode is the universal coded character set that by design includes abstract characters from all notable legacy coded character sets such that character encodings for legacy coded character sets can be redefined to represent Unicode scalar values. Consider representing ż in the ISO-8859-2 encoding. When we treat the ISO-8859-2 encoding as an encoding for the Unicode coded character set (as opposed treating it as an encoding for the ISO-8859-2 coded character set), byte 0xBF decodes to Unicode scalar value U+017C (and not as scalar value 0xBF).

A compatibility character is a character that according to Unicode principles should not be a distinct abstract character but that Unicode nonetheless codes as a distinct abstract character because some legacy coded character set treated it as distinct.

The Microsoft Code Page 932 Issue

Usually in C++ a “character type” refers to a code unit type, but the Text_view paper uses the term “character type” to refer to a Unicode scalar value when the encoding is a Unicode encoding form. The paper implies that an analogous non-Unicode type exists for Microsoft code page 932 (Microsoft’s version of Shift_JIS), but does one really exist?

Microsoft code page 932 takes the 8-bit encoding of JIS X 0201 coded character set, whose upper half is half-width katakana and lower half is ASCII-based, and replaces the lower half with actual US-ASCII (moving the difference between US-ASCII and the lower half of 8-bit-encoded JIS X 0201 into a font problem!). It then takes the JIS X 0208 coded character set and represents it with two-byte sequences (for the lead byte making use of the unassigned range of JIS X 0201). JIS X 0208 code points aren’t really one-dimensional scalars, but instead two-dimensional row and column numbers in a 94 by 94 grid. (See the first 94 rows of the visualization supplied with the Encoding Standard; avoid opening the link on RAM-limited device!) Shift_JIS / Microsoft code page 932 does not put these two numbers into bytes directly, but conceptually arranges each two rows of 94 columns into one row of a 188 columns and then transforms these new row and column numbers into bytes with some offsetting.

While the JIS X 0208 grid is rearranged into 47 rows of a 188-column grid, the full 188-column grid has 60 rows. The last 13 rows are used for IBM extensions and for private use. The private use area maps to the (start of the) Unicode Private Use Area. (See a visualization of the rearranged grid with the private use part showing up as unassigned; again avoid opening the link on a RAM-limited device.)

The extension part is where the concern that the Text_view paper seeks to address comes in. NEC and IBM came up with some characters that they felt JIS X 0208 needed to be extended with. NEC’s own extensions go onto row 13 (in one-based numbering) of the 94 by 94 JIS X 0208 grid (unallocated in JIS X 0208 proper), so that extension can safely be treated as if it had always been part of JIS X 0208 itself. The IBM extension, however, goes onto the last 3 rows of the 60-row Shift_JIS grid, i.e. outside the space that the JIS X 0208 94 by 94 grid maps to. However, US-ASCII, the half-width katakana part of JIS X 0201, and JIS X 0208 are also encoded, in a different way, by EUC-JP. EUC-JP can only encode the 94 by 94 grid of JIS X 0208. To make the IBM extensions fit into the 94 by 94 grid, NEC relocated the IBM extensions within the 94 by 94 grid in space that the JIS X 0208 standard left unallocated.

When considering IBM Shift_JIS and NEC EUC-JP (without later JIS X 0213 extension), both encode the same set of characters, but in a different way. Furthermore, both can round-trip via Unicode. Unicode principles analyze some of the IBM extension kanji as duplicates of kanji that were already in the original JIS X 0208. However, to enable round-tripping (which was thought worthwhile to achieve at the time), Unicode treats the IBM duplicates as compatibility characters. (Round-tripping is lost, of course, if the text decoded into Unicode is normalized such that compatibility characters are replaced with their canonical equivalents before re-encoding.)

This brings us to the issue that the Text_view paper treats as significant: Since Shift_JIS can represent the whole 94 by 94 JIS X 0208 grid and NEC put the IBM extension there, a naïve conversion from EUC-JP to Shift_JIS can fail to relocate the IBM extension characters to the end of the Shift_JIS code space and can put them in the position where they land if the 94 by 94 grid is simply transformed as the first 47 rows of the 188-column-wide Shift_JIS grid. When decoding to Unicode, Microsoft code page 932 supports both locations for the IBM extensions, but when encoding from Unicode, it has to pick one way of doing things, and it picks the end of the Shift_JIS code space.

That is, Unicode does not assign another set of compatibility characters to Microsoft code page 932’s duplication of the IBM extensions, so despite NEC EUC-JP and IBM Shift_JIS being round-trippable via Unicode, Microsoft code page 932, i.e. Microsoft Shift_JIS, is not. This makes sense considering that there is no analysis that claims the IBM and NEC instances of the IBM extensions as semantically different: They clearly have provenance that indicates that the duplication isn’t an attempt to make a distinction in meaning. The Text_view paper takes the position that C++ should round-trip the NEC instance of the IBM extensions in Microsoft code page 932 as distinct from the IBM instance of the IBM extensions even though Microsoft’s own implementation does not. In fact, the whole point of the Text_view paper mentioning Microsoft code page 932 is to give an example of a legacy encoding that doesn’t round-trip via Unicode, despite Unicode generally having been designed to round-trip legacy encodings, and to opine that it ought to round-trip in C++.

So:

The Text_view paper wants there to exist a non-transcoding-based, non-Unicode analog for what for UTF-8 would be a Unicode scalar value but for Microsoft code page 932 instead.
The standards that Microsoft code page 932 has been built on do not give us such a scalar.
- Even if the private use space and the extensions are considered to occupy a consistent grid with the JIS X 0208 characters, the US-ASCII plus JIS X 0201 part is not placed on the same grid.
- The canonical way of referring to JIS X 0208 independently of bytes isn’t a reference by one-dimensional scalar but a reference by two (one-based) numbers identifying a cell on the 94 by 94 grid.
The Text_view paper wants the scalar to be defined such that a distinction between the IBM instance of the IBM extensions and the NEC instance of the IBM extensions is maintained even though Microsoft, the originator of the code page, does not treat these two instances as meaningfully distinct.

Inferring a Coded Character Set from an Encoding

(This section is based on the constraints imposed by Text_view paper instead of being based on what the reference implementation does for Microsoft code page 932. From code inspection, it appears that support for multi-byte narrow execution encodings is unimplemented, and when trying to verify this experimentally, I timed out trying to get it running due to an internal compiler error when trying to build with a newer GCC and a GCC compilation error when trying to build the known-good GCC revision.)

While the standards don’t provide a scalar value definition for Microsoft code page 932, it’s easy to make one up based on tradition: Traditionally, the two-byte characters in CJK legacy encodings have been referred to by interpreting the two bytes as 16-bit big-endian unsigned number presented as hexadecimal (and single-byte characters as a 8-bit unsigned number).

As an example, let’s consider 猪 (which Wiktionary translates as wild boar). Its canonical Unicode scalar value is U+732A. That’s what the JIS X 0208 instance decodes to when decoding Microsoft code page 932 into Unicode. The compatibility character for the IBM kanji purpose is U+FA16. That’s what both the IBM instance of the IBM extension and the NEC instance of the IBM extension decode to when decoding Microsoft code page 932 into Unicode. (For reasons unknown to me, Unicode couples U+FA16 with the IBM kanji compatibility purpose and assigns another compatibility character, U+FAA0, for compatibility with North Korean KPS 10721-2000 standard, which is irrelevant to Microsoft code page 932. Note that not all IBM kanji have corresponding DPRK compatibility characters, so we couldn’t repurpose the DPRK compatibility characters for distinguishing the IBM and NEC instances of the IBM extensions even if we wanted to.)

When interpreting the Microsoft code page 932 bytes as a big-endian integer, the JIS X 0208 instance of 猪 would be 0x9296, the IBM instance would be 0xFB5E, and the NEC instance would be 0xEE42. To highlight how these “scalars” are coupled with the encoding instead of the standard character sets that the encodings originally encode, in EUC-JP the JIS X 0208 instance would be 0xC3F6 and the NEC instance would be 0xFBA3. Also, for illustration, if the same rule was applied to UTF-8, the scalar would be 0xE78CAA instead of U+732A. Clearly, we don’t want the scalars to be different between UTF-8, UTF-16, and UTF-32, so it is at least theoretically unsatisfactory for Microsoft code page 932 and EUC-JP to get different scalars for what are clearly the same characters in the underlying character sets.

It would be possible to do something else that’d give the same scalar values for Shift_JIS and EUC-JP without a lookup table. We could number the characters on the two-dimensional grid starting with 256 for the top left cell to reserve the scalars 0…255 for the JIS X 0201 part. It’s worth noting, though, that this approach wouldn’t work well for Korean and Simplified Chinese encodings that take inspiration from the 94 by 94 structure of JIS X 0208. KS X 1001 and GB2312 also define a 94 by 94 grid like JIS X 0208. However, while Microsoft code page 932 extends the grid down, so a consecutive numbering would just add greater numbers to the end, Microsoft code pages 949 and 936 extend the KS X 1001 and GB2312 grids above and to the left, which means that a consecutive numbering of the extended grid would be totally different from the consecutive numbering of the unextended grid. On the other hand, interpreting each byte pair as a big-endian 16-bit integer would yield the same values in the extended and unextended Korean and Simplified Chinese cases. (See visualizations for 949 and 936; again avoid opening on a RAM-limited device. Search for “U+3000” to locate the top left corner of the original 94 by 94 grid.)

What About EBCDIC?

Text_view wants to avoid transcoding overhead on z/OS, but z/OS has multiple character encodings for the ISO-8859-1 character set. It seems conceptually bogus for all these to have different scalar values for the same character set. However, for all of them to have the same scalar values, a lookup table-based permutation would be needed. If that table permuted to the ISO-8859-1 order, it would be the same as the Unicode order, at which point the scalar values might as well be Unicode scalar values, which Text_view wanted to avoid on z/OS citing performance concerns. (Of course, z/OS also has EBCDIC encodings whose character set is not ISO-8859-1.)

What About GB18030?

The whole point of GB18030 is that it encodes Unicode scalar values in a way that makes the encoding byte-compatible with GBK (Microsoft code page 936) and GB2312. This operation is inherently lookup table-dependent. Inventing a scalar definition for GB18030 that achieved the Text_view goal of avoiding lookup tables would break the design goal of GB18030 that it encodes all Unicode scalar values. (In the Web Platform, due to legacy reasons, all but one scalar value and representing one scalar value twice.)

What’s Wrong with This?

Let’s evaluate the above in the light of P1238R0, the SG16: Unicode Direction paper.

The reason why Text_view tries to fit Unicode-motivated operations onto legacy encodings is that, as noted by “1.1 Constraint: The ordinary and wide execution encodings are implementation defined”, non-UTF execution encodings exist. This is, obviously, true. However, I disagree with the conclusion of making new features apply to these pre-existing execution encodings. I think there is no obligation to adapt new features to make sense for non-UTF execution encodings. It should be sufficient to keep existing legacy code running, i.e. not removing existing features should be sufficient. On the topic of wchar_t the Unicode Direction paper, says “1.4. Constraint: wchar_t is a portability deadend”. I think char with non-UTF-8 execution encoding should also be declared as a deadend whereas the Unicode Direction paper merely notes “1.3. Constraint: There is no portable primary execution encoding”. Making new features work with deadend foundation lures applications deeper into deadends, which is bad.

While inferring scalar values for an encoding by interpreting the encoded bytes for each character as a big-endian integer (thereby effectively inferring a, potentially non-standard, coded character set from an encoding) might be argued to be traditional enough to fit “2.1. Guideline: Avoid excessive inventiveness; look for existing practice”, it is a bad fit for “1.6. Constraint: Implementors cannot afford to rewrite ICU”. If there is concern about implementors not having the bandwidth to implement text processing features from scratch and, therefore, should be prepared to delegate to ICU, it makes no sense make implementations or the C++ standard come up with non-Unicode numberings for abstract characters, since such numberings aren’t supported by ICU and necessarily would require writing new code for anachronistic non-Unicode schemes.

Aside: Maybe analyzing the approach of using byte sequences interpreted as big-endian numbers looks like attacking a straw man and there could be some other non-Unicode numbering instead, such as the consecutive numbering outlined above. Any alternative non-Unicode numbering would still fail “1.6. Constraint: Implementors cannot afford to rewrite ICU” and would also fail “2.1. Guideline: Avoid excessive inventiveness; look for existing practice”.

Furthermore, I think the Text_view paper’s aspiration of distinguishing between the IBM and NEC instances of the IBM extensions in Microsoft code page 932 fails “2.1. Guideline: Avoid excessive inventiveness; look for existing practice”, because it effectively amounts to inventing additional compatibility characters that aren’t recognized as distinct by Unicode or the originator of the code page (Microsoft).

Moreover, iterating over a buffer of text by scalar value is a relatively simple operation when considering the range of operations that make sense to offer for Unicode text but that may not obviously fit non-UTF execution encodings. For example, in the light of “4.2. Directive: Standardize generic interfaces for Unicode algorithms” it would be reasonable and expected to provide operations for performing Unicode Normalization on strings. What does it mean to normalize a string to Unicode Normalization Form D under the ISO-8859-1 execution encoding? What does it mean to apply any Unicode Normalization Form under the windows-1258 execution encoding, which represents Vietnamese in a way that doesn’t match any Unicode Normalization Form? If the answer just is to make these no-ops for non-UTF encodings, would that be the right answer for GB18030? Coming up with answers other than just saying that new text processing operations shouldn’t try to fit non-UTF encodings at all would very quickly violate the guideline to “Avoid excessive inventiveness”.

Looking at other programming languages in the light of “2.1. Guideline: Avoid excessive inventiveness; look for existing practice” provides the way forward. Notable other languages have settled on not supporting coded character sets other than Unicode. That is, only the Unicode way of assigning scalar values to abstract characters is supported. Interoperability with legacy character encodings is achieved by decoding into Unicode upon input and, if non-UTF-8 output is truly required for interoperability, by encoding into legacy encoding upon output. The Unicode Direction paper already acknowledges this dominant design in “4.4. Directive: Improve support for transcoding at program boundaries”. I think C++ should consider the boundary between non-UTF-8 char and non-UTF-16/32 wchar_t on one hand and Unicode (preferably represented as UTF-8) on the other hand as a similar transcoding boundary between legacy code and new code such that new text processing features (other than the encoding conversion feature itself!) are provided on the char8_t/char16_t/char32_t side but not on the non-UTF execution encoding side. That is, while the Text_view paper says “Transcoding to Unicode for all non-Unicode encodings would carry non-negligible performance costs and would pessimize platforms such as IBM’s z/OS that use EBCIDC [sic] by default for the non-Unicode execution character sets.”, I think it’s more appropriate to impose such a cost at the boundary of legacy and future parts of z/OS programs than to contaminate all new text processing APIs with the question “What does this operation even mean for non-UTF encodings generally and EBCDIC encodings specifically?”. (In the case of Windows, the system already works in UTF-16 internally, so all narrow execution encodings already involve transcoding at the system interface boundary. In that context, it seems inappropriate to pretend that the legacy narrow execution encodings on Windows were somehow free of transcoding cost to begin with.)

To avoid a distraction from my main point, I’m explicitly not opining in this document on whether new text processing features should be available for sequences of char when the narrow execution encoding is UTF-8, for sequences of wchar_t when sizeof(wchar_t) is 2 and the wide execution encoding is UTF-16, or for sequences of wchar_t when sizeof(wchar_t) is 4 and the wide execution encoding is UTF-32.

The Type for a Unicode Scalar Value Should Be `char32_t`

The conclusion of the previous section is that new C++ facilities should not support number assignments to abstract characters other than Unicode, i.e. should not support coded character sets (either standardized or inferred from an encoding) other than Unicode. The conclusion makes it unnecessary to abstract type-wise over Unicode scalar values and some other kinds of scalar values. It just leaves the question of what the concrete type for a Unicode scalar value should be.

The Text_view paper says:

“It has been suggested that char32_t be supported as a character type that is implicitly associated with the Unicode character set and that values of this type always be interpreted as Unicode code point values. This suggestion is intended to enable UTF-32 string literals to be directly usable as sequences of character values (in addition to being sequences of code unit and code point values). This has a cost in that it prohibits use of the char32_t type as a code unit or code point type for other encodings.

I disagree with this and am firmly in the camp that char32_t should be the type for a Unicode scalar value.

The sentence “This has a cost in that it prohibits use of the char32_t type as a code unit or code point type for other encodings.” is particularly alarming. Seeking to use char32_t as a code unit type for encodings other than UTF-32 would dilute the meaning of char32_t into another wchar_t mess. (I’m happy to see that P1041R4 “Make char16_t/char32_t string literals be UTF-16/32” was voted into C++20.)

As for the appropriateness of using the same type both for a UTF-32 code unit and a Unicode scalar value, the whole point of UTF-32 is that its code unit value is directly the Unicode scalar value. That is what UTF-32 is all about, and UTF-32 has nothing else to offer: The value space that UTF-32 can represent is more compactly represented by UTF-8 and UTF-16 both of which are more commonly needed for interoperation with existing interfaces. When having the code units be directly the scalar values is UTF-32’s whole point, it would be unhelpful to distinguish type-wise between UTF-32 code units and Unicode scalar values. (Also, considering that buffers of UTF-32 are rarely useful but iterators yielding Unicode scalar values make sense, it would be sad to make the iterators have a complicated type.)

To provide interfaces that are generic across std::u8string_view, std::u16string_view, and std::u32string_view (and, thereby, strings for which these views can be taken), all of these should have a way to obtain a scalar value iterator that yields char32_t values. To make sure such iterators really yield only Unicode scalar values in an interoperable way, the iterator should yield U+FFFD upon error. What constitutes a single error in UTF-8 is defined in the WHATWG Encoding Standard (matches the “best practice” from the Unicode Standard). In UTF-16, each unpaired surrogate is an error. In UTF-32, each code unit whose numeric value isn’t a valid Unicode scalar value is an error. (The last sentence might be taken as admission that UTF-32 code units and scalar values are not the same after all. It is not. It is merely an acknowledgement that C++ does not statically prevent programs that could erroneously put an invalid value into a buffer that is supposed to be UTF-32.)

In general, new APIs should be defined to handle invalid UTF-8/16/32 either according to the replacement behavior described in the previous paragraph or by stopping and signaling error on the first error. In particular, the replacement behavior should not be left as implementation-defined, considering that differences in the replacement behavior between V8 and Blink lead to a bug. (See another write-up on this topic.)

Transcoding Should Be `std::span`-Based Instead of Iterator-Based

Since the above contemplates a conversion facility between legacy encodings and Unicode encoding forms, it seems on-topic to briefly opine on what such an API should look like. The Text_view paper says:

Transcoding between encodings that use the same character set is currently possible. The following example transcodes a UTF-8 string to UTF-16.
std::string in = get_a_utf8_string();
std::u16string out;
std::back_insert_iterator<std::u16string> out_it{out};
auto tv_in = make_text_view<utf8_encoding>(in);
auto tv_out = make_otext_iterator<utf16_encoding>(out_it);
std::copy(tv_in.begin(), tv_in.end(), tv_out);
Transcoding between encodings that use different character sets is not currently supported due to lack of interfaces to transcode a code point from one character set to the code point of a different one.
Additionally, naively transcoding between encodings using std::copy() works, but is not optimal; techniques are known to accelerate transcoding between some sets of encoding. For example, SIMD instructions can be utilized in some cases to transcode multiple code points in parallel.
Future work is intended to enable optimized transcoding and transcoding between distinct character sets.

I agree with the assessment that iterator and std::copy()-based transcoding is not optimal due to SIMD considerations. To enable the use of SIMD, the input and output should be std::spans, which, unlike iterators, allow the converter to look at more than one element of the std::span at a time. I have designed and implemented such an API for C++, and I invite SG16 to adopt its general API design. I have a written a document that covers the API design problems that I sought to address and design of the API (in Rust but directly applicable to C++). (Please don’t be distracted by the implementation internals being Rust instead of C++. The API design is still valid for C++ even if the design constraint of the implementation internals being behind C linkage is removed. Also, please don’t be distracted by the API predating char8_t.)

Implications for Text_view

Above I’ve opined that only UTF-8, UTF-16, and UTF-32 (as Unicode encoding forms—not as Unicode encoding schemes!) should be supported for iteration by scalar value and that legacy encodings should be addressed by a conversion facility. Therefore, I think that Text_view should not be standardized as proposed. Instead, I think std::u8string_view, std::u16string_view, and std::u32string_view should gain a way to obtain a Unicode scalar value iterator (that yields values of type char32_t), and a std::span-based encoding conversion API should be provided as a distinct feature (as opposed to trying to connect Unicode scalar value iterators with std::copy()).