HP IAP Version 2.1 User Guide, March 2011

Topics include:

• Word characters and separators, page 52

• Regular expression definition of English word characters, page 52

Word characters and separators

Word characters include all uppercase and lowercase letters, digits, and the following additional

characters:

• _ (underscore)

• # (number/pound/hash sign)

• & (ampersand)

All other characters are separators (except in queries, wildcards ? and *, and special query characters

~, ", -, and !).

However, && by itself is not a word. It is a Boolean operator. When combined with at least one more

word character, && can be part of a word. For example, a&&b is a word.

Query analysis and document indexing are not case-sensitive. Uppercase and lowercase letters are

treated the same.

Regular expression definition of English word characters

The following regular expression provides, in succinct form, a complete specification of English word

characters (except for treatment of && as a non-word):

[ A-Za-z0-9_#& ]+

Letters and digits in different character sets

Topics include:

• Letters and digits defined, page 52

• Letters and digits in files, page 53

Letters and digits defined

All letters and digits are word characters. What IAP considers a letter or digit depends on the character

set encoding used. For US ASCII encoding, letters are uppercase and lowercase English letters (A-Z,

a-z). For ISO 8859-1 (Latin-1) encoding, used for Western European languages, accented letters are

included. Most ideographic characters, such as those used in Asian languages, are also considered

letters.

Whatever the language and encoding used for a particular document (file or email message), IAP

maps encoded characters to the Unicode 2.0 standard. The Unicode 2.0 standard is then used to

determine if a given character is a letter or a digit (or neither):

• A letter is any Unicode character in one of the following Unicode categories: Ll (lowercase letter),

Lu (uppercase letter), Lt (title case letter), Lm (modifier letter), or Lo (other letter).

• A digit is any Unicode character whose Unicode name contains the word DIGIT, provided it is

not in the range \u2000 (en quad = en space) through \u2FFF (ideographic description - future).

Query expression syntax and matching52