HP-UX 11i v3 Internationalization Features

4. Code Set Conversions

Unicode 5.0 additions have been made to iconv converters to support new Unicode 5.0

characters, surrogate characters, byte-order marks, and all forms of Unicode-specified

transformations (including UTF-8, UTF-16, UTF-32, big and little-endian forms). Refer to the

"system.config.iconv" file under /usr/lib/nls/iconv for the complete listing of all iconv converters

supported as part of the base operating system.

Conversions Between Unicode Variants

For HP-UX 11i v3, a complete set of bidirectional converters between all Unicode variants are

provided. Unicode variants (their aliases are in the table below) are different ways of representing

(encoding) Unicode code points. For example, UTF-8 is a byte oriented encoding that uses 1 to 4

bytes; UTF-16 and UTF-32 use fixed 2- or 4-byte integers, respectively, to represent the Unicode 5.0

code point range.

HP-UX 11i v3 also supports the following Unicode properties:

• Different byte orders in data encoded in UTF-16/32. Byte order can be big-endian (BE), where

bytes that constitute a UTF-16/32 value in memory or file are arranged with the most significant

byte (big end) first, or little-endian (LE), where the least significant byte occurs first.

• Byte order mark (BOM): A byte order mark occurring at the beginning of UTF-16/32 data stream

indicates the endianness of the data. There are big-endian and little-endian BOMs. In the

absence of a BOM, UTF-16/32 data streams are considered big-endian.

• Surrogate area: a single 16-bit word cannot represent the code points beyond the first 64K

(Basic Multilingual Plane or BMP) of the Unicode 5.0 code space. UTF-16 encoding sets aside

two contiguous 1-k code point regions in the BMP and concatenation of lower 10-bits of two

code points from each of these regions (high and low surrogate areas) with the implicit addition

of a 64K offset encodes code points beyond the BMP.

Unicode variant aliases used by the HP-UX iconv command

base name UCS alias UTF alias

ucs2 UCS-2 UTF-16

ucs2be UCS-2BE UTF-16BE

ucs2le UCS-2LE UTF-16LE

ucs4 UCS-4 UTF-32

ucs4be UCS-4BE UTF-32BE

ucs4le UCS-4LE UTF-32LE

utf8 UTF8, UTF-8

Matrix of supported conversions (checked boxes) between various Unicode variants

from\to ucs2 ucs2be ucs2le ucs4 ucs4be ucs4le utf8

ucs2 - • • • • •

ucs2be - • • • • •