Software Internationalization Guide

Software Characteristics That Vary by Locale
Software Internationalization Guide526225-002
2-14
Collation
Class-conversion routines might be written, for example, for cases in which each letter
has only one uppercase version and one lowercase version. In French, however,
lowercase letters may lose their diacriticals when converted to uppercase—e, è, é, and
ê may all convert to E. To meet international needs, locales give users the option of
defining uppercase and lowercase mappings so that diacriticals are not lost.
Collation
Collation is the logical ordering of characters based on defined precedence algorithms.
Collation algorithms vary from one language to another and can be based on character
sets, character encoding values, user-defined ordering, or numerous other factors.
Internationalized software must support a large variety of collation algorithms to
accommodate all existing and future written languages.
Character-Encoded Collation
A frequently used collation method is based on the encoded values of a character set.
Table 2-4 illustrates an ASCII-encoded collation scheme.
For characters in the order D, F, C, A, B, E, the result after collating by increasing
ASCII-encoded values is A, B, C, D, E, F. However, the encoded collation scheme
used for the ASCII character set is inappropriate for most other character sets.
Collating ASCII characters by their encoded values works well because the ASCII
character set is encoded in order; this is not the case for most character sets. The
encoded values of uppercase characters in the ISO 8859-1 character set are not in any
particular order.
Potential problems exist with collation based on character encoding. If a code set is
used for more than one language, as in the case of ISO 10646, collating by encoded
values is difficult because the same character is likely to appear in various positions
depending on the language. For example, the character ä appears in the beginning of
the German alphabet, but it appears at the end of the Swedish alphabet.
Some languages require multiple collation passes. In ASCI, encoding of all uppercase
characters is positioned before encoding of all lowercase characters, but there might
Table 2-4. ASCII Code Set
ASCII Encoded Values Before Collation After Collation
A=65 D A
B=66 F B
C=67 C C
D=68 A D
E=69 B E
F=70 E F