Software Internationalization Guide

Software Characteristics That Vary by Locale
Software Internationalization Guide526225-002
2-15
Character-Set Collation
be instances in which an uppercase character is followed by its lowercase counterpart
instead of the next uppercase character. For example, instead of the traditional A, B, C
…, a, b, c order, the appropriate collation scheme might be A, a, B, b, C, c, …, Z, z.
Character-Set Collation
Character-set collation schemes are based on the actual character instead of the
encoded values, resolving some problems of character-encoded collation. With
character-set collation, all existing and new character sets can be collated
appropriately, independent of encoded values.
With this approach, a number of different collation orders can be defined for a single
character set. Character sets that are case insensitive can have collation orders in
which the uppercase and lowercase versions of a single character have the same sort
value. Punctuation, symbols, and word hyphenators can be defined with the rest of the
character set.
Multilevel Collation
With multilevel collation, several collation passes are made to refine collations.
Collation that involves case-sensitive characters and diacriticals often requires
multilevel collation passes. In Spanish, for example, characters with the same base
character (with or without diacriticals) are weighed equally during collation.
Table 2-5 shows the results of a multilevel collation based on the Spanish character set
collation scheme. In the first collation pass, characters are grouped according to the
base character without consideration for the diacritical. The characters a and á are
therefore weighed equally in the first pass and the words mas and más collate the
same. Because the two words collate the same, a second collation pass is made on
them. The second pass recognizes the diacritical above the base character a so that
the character a sorts first, followed by the character á.
Ideographic Character Collation
Ideographic writing systems are composed of several thousand characters. Collation
methods in ideographic writing systems are more complex than those used for
phonetic systems, and can be based on various factors. Generally, a collation scheme
based on a combination of stroke count, radical base, and phonetics is used.
Table 2-5. Multilevel Collation
Words to Collate Result of Collation
masacrar mas
mas más
máscara masa
más masacrar
masa máscara