C/C++ Programmer's Guide (G06.27+, H06.08+, J06.03+)

Multibyte Characters
The basic difficulty in an Asian environment is the huge number of ideograms that are needed
for I/0, for example Chinese characters. To work within the constraints of usual computer
architectures, these ideograms are encoded as sequences of bytes. The associated operating
systems, application programs, and terminals understand these byte sequences as individual
ideograms. Moreover, all these encodings allow intermixing of regular single-byte C characters
with the ideogram byte sequences.
The term “multibyte character” denotes a byte sequence that encodes an ideogram. The byte
sequence contains one or more codes where each code can be represented in a C character
data type: char, signed char, or unsigned char. All multibyte characters are members of the
so-called extended character set. A regular single-byte C character is just a special case of a
multibyte sequence where the sequence has a length of one.
Wide Characters
Some of the inconvenience of handling multibyte characters is eliminated if all characters are
of a uniform number of bytes or bits. A 16-bit integer value is used to represent all members
because there can be thousands or tens of thousands of ideograms in an Asian character set.
Wide characters are integers of type wchar_t, defined in the headers stddef.h and
stdlib.h as:
typedef unsigned short wchar_t;
Such an integer can represent distinct codes for each of the characters in the extended character
set. The codes for the basic C character set have the same values as their single-character
forms.
Relationship Between Multibyte and Wide Characters
Multibyte characters are convenient for communicating between the program and the outside
world.
Wide characters are convenient for manipulating text within a program.
The fixed size of wide characters simplifies handling both individual characters and arrays
of characters.
MB_CUR_MAX Macro
The MB_CUR_MAX macro specifies the maximum number of bytes used in representing a
multibyte character in the current locale (category LC_CTYPE). The MB_CUR_MAX macro is
defined in the header STDLIBH as:
#define MB_CUR_MAX 2
Conversion Functions
The run-time library functions that manage multibyte characters and wide characters are:
DescriptionFunction
Determines the length of a multibyte character.mblen()
Converts a multibyte character to a wide character.mbtowc()
Converts a wide character to a multibyte character.wctomb()
Converts a string of multibyte characters to a string of wide characters.mbstowcs()
Converts a string of wide characters to a string of multibyte characters.wcstombs()
404 HP C Implementation-Defined Behavior