HP-UX 11i v3 Internationalization Features

10
4. Code Set Conversions
Unicode 5.0 additions have been made to iconv converters to support new Unicode 5.0
characters, surrogate characters, byte-order marks, and all forms of Unicode-specified
transformations (including UTF-8, UTF-16, UTF-32, big and little-endian forms). Refer to the
"system.config.iconv" file under /usr/lib/nls/iconv for the complete listing of all iconv converters
supported as part of the base operating system.
Conversions Between Unicode Variants
For HP-UX 11i v3, a complete set of bidirectional converters between all Unicode variants are
provided. Unicode variants (their aliases are in the table below) are different ways of representing
(encoding) Unicode code points. For example, UTF-8 is a byte oriented encoding that uses 1 to 4
bytes; UTF-16 and UTF-32 use fixed 2- or 4-byte integers, respectively, to represent the Unicode 5.0
code point range.
HP-UX 11i v3 also supports the following Unicode properties:
Different byte orders in data encoded in UTF-16/32. Byte order can be big-endian (BE), where
bytes that constitute a UTF-16/32 value in memory or file are arranged with the most significant
byte (big end) first, or little-endian (LE), where the least significant byte occurs first.
Byte order mark (BOM): A byte order mark occurring at the beginning of UTF-16/32 data stream
indicates the endianness of the data. There are big-endian and little-endian BOMs. In the
absence of a BOM, UTF-16/32 data streams are considered big-endian.
Surrogate area: a single 16-bit word cannot represent the code points beyond the first 64K
(Basic Multilingual Plane or BMP) of the Unicode 5.0 code space. UTF-16 encoding sets aside
two contiguous 1-k code point regions in the BMP and concatenation of lower 10-bits of two
code points from each of these regions (high and low surrogate areas) with the implicit addition
of a 64K offset encodes code points beyond the BMP.
Unicode variant aliases used by the HP-UX iconv command
base name UCS alias UTF alias
ucs2 UCS-2 UTF-16
ucs2be UCS-2BE UTF-16BE
ucs2le UCS-2LE UTF-16LE
ucs4 UCS-4 UTF-32
ucs4be UCS-4BE UTF-32BE
ucs4le UCS-4LE UTF-32LE
utf8 UTF8, UTF-8
Matrix of supported conversions (checked boxes) between various Unicode variants
from\to ucs2 ucs2be ucs2le ucs4 ucs4be ucs4le utf8
ucs2 -
ucs2be -