HP IAP Version 2.1 User Guide, March 2011
Topics include:
• Word characters and separators, page 52
• Regular expression definition of English word characters, page 52
Word characters and separators
Word characters include all uppercase and lowercase letters, digits, and the following additional
characters:
• _ (underscore)
• # (number/pound/hash sign)
• & (ampersand)
All other characters are separators (except in queries, wildcards ? and *, and special query characters
~, ", -, and !).
However, && by itself is not a word. It is a Boolean operator. When combined with at least one more
word character, && can be part of a word. For example, a&&b is a word.
Query analysis and document indexing are not case-sensitive. Uppercase and lowercase letters are
treated the same.
Regular expression definition of English word characters
The following regular expression provides, in succinct form, a complete specification of English word
characters (except for treatment of && as a non-word):
[ A-Za-z0-9_#& ]+
Letters and digits in different character sets
Topics include:
• Letters and digits defined, page 52
• Letters and digits in files, page 53
Letters and digits defined
All letters and digits are word characters. What IAP considers a letter or digit depends on the character
set encoding used. For US ASCII encoding, letters are uppercase and lowercase English letters (A-Z,
a-z). For ISO 8859-1 (Latin-1) encoding, used for Western European languages, accented letters are
included. Most ideographic characters, such as those used in Asian languages, are also considered
letters.
Whatever the language and encoding used for a particular document (file or email message), IAP
maps encoded characters to the Unicode 2.0 standard. The Unicode 2.0 standard is then used to
determine if a given character is a letter or a digit (or neither):
• A letter is any Unicode character in one of the following Unicode categories: Ll (lowercase letter),
Lu (uppercase letter), Lt (title case letter), Lm (modifier letter), or Lo (other letter).
• A digit is any Unicode character whose Unicode name contains the word DIGIT, provided it is
not in the range \u2000 (en quad = en space) through \u2FFF (ideographic description - future).
Query expression syntax and matching52