C/C++ Programmer's Guide (G06.25+)

HP C Implementation-Defined Behavior
HP C/C++ Programmer’s Guide for NonStop Systems429301-008
A-27
G.5 Common Extensions
Multibyte Characters and Wide Characters
Multibyte characters and wide characters support Asian alphabets that often contain a
very large number of characters. The Guardian TNS C run-time library functions,
except for the strcoll() and strxfrm() functions, support these character sets:
Tandem Kanji, Chinese Big 5, Chinese PC, Hangul and KSC5601.
The following discussion of multibyte characters applies only to the Guardian
environment. For details on multibyte characters in the Open System Services (OSS)
environment, refer to the Software Internationalization Manual.
The D30 and later Guardian C run-time library functions mblen(), mbtoc(),
mbtowcs(), wctomb(), and wctombs() do not support multibyte characters for
programs that use the 32-bit (or wide) data model as described in this section.
Guardian programs that use the 32-bit data model must use the Guardian system
procedures that support multibyte characters instead. For details, refer to the Guardian
Programmers Guide.
The default character set supported by a system is configured at system installation
time and cannot be changed during program execution. The Guardian procedure
MBCS_DEFAULTCHARSET_ returns the identifier of the default character set. The
Guardian Procedure Calls Reference Manual describes this system procedure in
detail.
The internal representation of the characters of these languages is HP internal and
might not conform to any ISO standard. HP can choose to change this internal
representation at any time.
Multibyte Characters:
The basic difficulty in an Asian environment is the huge number of ideograms that
are needed for I/0, for example Chinese characters. To work within the constraints
of usual computer architectures, these ideograms are encoded as sequences of
bytes. The associated operating systems, application programs, and terminals
understand these byte sequences as individual ideograms. Moreover, all of these
encodings allow intermixing of regular single-byte C characters with the ideogram
byte sequences.
The term “multibyte character” denotes a byte sequence that encodes an
ideogram. The byte sequence contains one or more codes where each code can
be represented in a C character data type: char, signed char, or unsigned char. All
multibyte characters are members of the so-called extended character set. A
regular single-byte C character is just a special case of a multibyte sequence
where the sequence has a length of one.
Wide Characters:
Some of the inconvenience of handling multibyte characters is eliminated if all
characters are of a uniform number of bytes or bits. A 16-bit integer value is used
to represent all members because there can be thousands or tens of thousands of
ideograms in an Asian character set.