HP IAP Version 2.1 User Guide, March 2011

Topics include:
Word characters and separators, page 52
Regular expression definition of English word characters, page 52
Word characters and separators
Word characters include all uppercase and lowercase letters, digits, and the following additional
characters:
_ (underscore)
# (number/pound/hash sign)
& (ampersand)
All other characters are separators (except in queries, wildcards ? and *, and special query characters
~, ", -, and !).
However, && by itself is not a word. It is a Boolean operator. When combined with at least one more
word character, && can be part of a word. For example, a&&b is a word.
Query analysis and document indexing are not case-sensitive. Uppercase and lowercase letters are
treated the same.
Regular expression definition of English word characters
The following regular expression provides, in succinct form, a complete specification of English word
characters (except for treatment of && as a non-word):
[ A-Za-z0-9_#& ]+
Letters and digits in different character sets
Topics include:
Letters and digits defined, page 52
Letters and digits in files, page 53
Letters and digits defined
All letters and digits are word characters. What IAP considers a letter or digit depends on the character
set encoding used. For US ASCII encoding, letters are uppercase and lowercase English letters (A-Z,
a-z). For ISO 8859-1 (Latin-1) encoding, used for Western European languages, accented letters are
included. Most ideographic characters, such as those used in Asian languages, are also considered
letters.
Whatever the language and encoding used for a particular document (file or email message), IAP
maps encoded characters to the Unicode 2.0 standard. The Unicode 2.0 standard is then used to
determine if a given character is a letter or a digit (or neither):
A letter is any Unicode character in one of the following Unicode categories: Ll (lowercase letter),
Lu (uppercase letter), Lt (title case letter), Lm (modifier letter), or Lo (other letter).
A digit is any Unicode character whose Unicode name contains the word DIGIT, provided it is
not in the range \u2000 (en quad = en space) through \u2FFF (ideographic description - future).
Query expression syntax and matching52