Standard C++ Library Reference ISO/IEC (VERSION3)

Multibyte Characters
A source character set or target character set can also contain multibyte characters (sequences
of one or more bytes). Each sequence represents a single character in the extended character
set. You use multibyte characters to represent large sets of characters, such as Kanji. A
multibyte character can be a one-byte sequence that is a character from the basic C character
set, an additional one-byte sequence that is implementation defined, or an additional sequence
of two or more bytes that is implementation defined.
Any multibyte encoding that contains sequences of two or more bytes depends, for its
interpretation between bytes, on a conversion state determined by bytes earlier in the sequence
of characters. In the initial conversion state if the byte immediately following matches one of
the characters in the basic C character set, the byte must represent that character.
For example, the EUC encoding is a superset of ASCII. A byte value in the interval [0xA1,
0xFE] is the first of a two-byte sequence (whose second byte value is in the interval [0x80,
0xFF]). All other byte values are one-byte sequences. Since all members of the basic C
character set have byte values in the range [0x00, 0x7F] in ASCII, EUC meets the requirements
for a multibyte encoding in Standard C. Such a sequence is not in the initial conversion state
immediately after a byte value in the interval [0xA1, 0xFe]. It is ill-formed if a second byte
value is not in the interval [0x80, 0xFF].
Multibyte characters can also have a state-dependent encoding. How you interpret a byte in
such an encoding depends on a conversion state that involves both a parse state, as before, and
a shift state, determined by bytes earlier in the sequence of characters. The initial shift state, at
the beginning of a new multibyte character, is also the initial conversion state. A subsequent
shift sequence can determine an alternate shift state, after which all byte sequences (including
one-byte sequences) can have a different interpretation. A byte containing the value zero,
however, always represents the null character. It cannot occur as any of the bytes of another
multibyte character.
For example, the JIS encoding is another superset of ASCII. In the initial shift state, each byte
represents a single character, except for two three-byte shift sequences:
The three-byte sequence "\x1B$B" shifts to two-byte mode. Subsequently, two
successive bytes (both with values in the range [0x21, 0x7E]) constitute a single
multibyte character.
The three-byte sequence "\x1B(B" shifts back to the initial shift state.
JIS also meets the requirements for a multibyte encoding in Standard C. Such a sequence is not
in the initial conversion state when partway through a three-byte shift sequence or when in
two-byte mode.
(Amendment 1 adds the type mbstate_t, which describes an object that can store a
conversion state. It also relaxes the above rules for generalized multibyte characters, which
describe the encoding rules for a broad range of wide streams.)