HP Software Reference Information Storage System v1.6 User Guide (T3559-90810, November 2008)

Word characters and separators
Word characters include all uppercase and lowercase letters, digits, and the following additional
characters:
_(underscore)
# (number/pound/hash sign)
& (ampersand)
All other chara
cters are separators (excep t i n queries, wildcards ? and *, and special quer y characters
~, ", -,and!).
However, && by itself is not a word. It is a Boolean operator. When combined with at least one more
word characte
r, && canbepartofaword. Forexample,a&&b is a word.
Query analysis and document indexing are not case-sensitive. Uppercase and lowercase letters are
treated the same.
Regular expression denition of English word characters
The following regular expression provides, in succinct form, a complete specication of English word
characters (except for treatment of && as a non-word):
[ A-Za-z0-9_#& ]+
Letters and digits in different character sets
Topics include:
Letters and digits dened, page 38
Letters and
digits in les, page 38
Letters and digits dened
All letters and digits are word characters. What RISS considers a letter or digit depends on the character
set encoding used. For US ASC II encoding, letters are uppercase and lowercase E nglish letters (A-Z, a-z).
For ISO 8859-1 (Latin-1) encoding, used for Western European languages, accented letters are included.
Most ideographic characters, such as those used in A sian la n guages, are also c onsidered letters.
Whatever the language and encoding used for a par ticular document ( le or email m essage), RISS maps
encoded characters to the Unicode 2.0 standard. The Unicode 2.0 standard is then used to determine if
a given character is a letter or a digit (or neither):
A letter is any Unicode character in one of the following Unicode categories: Ll (lowercase letter),
Lu(uppercaseletter),Lt(titlecaseletter),Lm(modier letter), or Lo (other letter).
A digit is any Unicode character whose Unicode name contains the word DIGIT, provided it is not
in the range \u2000 (en quad = en space) through \u2FFF (ideographic description - future) .
Letter
sanddigitsinles
Althou
gh all letters a n d digits are word characters, their treatment in les (including email message
attac
hments) depends on the character encoding used. You can search for any words in email message
bodie
s and headers, regardless of the encoding.
You ca
nsearchforwordsinles (including email body, header, at tachments, and indexed documents)
prov
ided the character encoding is one the following:
38
Query expression syntax and matching