Software Internationalization Guide

Software Characteristics That Vary by Locale

Software Internationalization Guide—526225-002

2-13

Character Classification

Character classification is the grouping of characters into named classes that share an

attribute associated with the name of the class. For example, ASCII character classes

are uppercase, lowercase, alphabet, digit, and punctuation. It is easiest to determine

how to process characters if the classification of a character is defined.

Internationalization requires additional character classifications to accommodate rules

and symbols beyond those included in the ASCII character set.

Case Classification

Classifying characters by uppercase or lowercase does not satisfy all international

character sets. Many writing systems such as Arabic, Hindi, and Japanese have

character sets that are not differentiated by case. Languages that do differentiate by

case contain exceptions. In German, for example, the lowercase character ß has no

single-character, uppercase equivalent. Instead, the character ß is converted to the

two uppercase characters SS.

New Character Classifications

Most programming languages do not allow classifying character sets for languages

that are not Latin-based. The phonetic and ideographic writing systems are two

examples of systems that do not classify characters. For languages like Hindi and

Thai, classes must differentiate between vowels and consonants.

Classification Functions

Programs often process characters based on their character classification groups.

Internationalized character classification functions are locale-dependent and help

programmers avoid hard-coding characters that belong to a given class. Some

programming languages provide classification features to support internationalization.

The C programming language, for example, includes the isalpha() function to

determine if a character is a valid alphabetic character. Instead of comparing a

character to hard-coded characters in the ASCII code set, a program calls isalpha()

to determine if the character belongs to the alphabet class appropriate to the current

locale. The isalpha() function can then be used to determine if a character belongs

to the alphabet class of a new locale.

For example, in a program enabled for the US English locale, isalpha() returns

false to verify that the character í is not a valid member of the US English alphabet. If

the same program is enabled for Spanish, isalpha() returns true because the

character í is a valid member of the Spanish alphabet.

The C programming language also has routines that perform class conversions. For

example, tolower() converts uppercase characters to lowercase characters. Most

existing C-type functions, however, provide support only for the ASCII code set.