Software Internationalization Guide
Software Characteristics That Vary by Locale
Software Internationalization Guide—526225-002
2-13
Character Classification
Character Classification
Character classification is the grouping of characters into named classes that share an
attribute associated with the name of the class. For example, ASCII character classes
are uppercase, lowercase, alphabet, digit, and punctuation. It is easiest to determine
how to process characters if the classification of a character is defined.
Internationalization requires additional character classifications to accommodate rules
and symbols beyond those included in the ASCII character set.
Case Classification
Classifying characters by uppercase or lowercase does not satisfy all international
character sets. Many writing systems such as Arabic, Hindi, and Japanese have
character sets that are not differentiated by case. Languages that do differentiate by
case contain exceptions. In German, for example, the lowercase character ß has no
single-character, uppercase equivalent. Instead, the character ß is converted to the
two uppercase characters SS.
New Character Classifications
Most programming languages do not allow classifying character sets for languages
that are not Latin-based. The phonetic and ideographic writing systems are two
examples of systems that do not classify characters. For languages like Hindi and
Thai, classes must differentiate between vowels and consonants.
Classification Functions
Programs often process characters based on their character classification groups.
Internationalized character classification functions are locale-dependent and help
programmers avoid hard-coding characters that belong to a given class. Some
programming languages provide classification features to support internationalization.
The C programming language, for example, includes the isalpha() function to
determine if a character is a valid alphabetic character. Instead of comparing a
character to hard-coded characters in the ASCII code set, a program calls isalpha()
to determine if the character belongs to the alphabet class appropriate to the current
locale. The isalpha() function can then be used to determine if a character belongs
to the alphabet class of a new locale.
For example, in a program enabled for the US English locale, isalpha() returns
false to verify that the character í is not a valid member of the US English alphabet. If
the same program is enabled for Spanish, isalpha() returns true because the
character í is a valid member of the Spanish alphabet.
The C programming language also has routines that perform class conversions. For
example, tolower() converts uppercase characters to lowercase characters. Most
existing C-type functions, however, provide support only for the ASCII code set.