HP StorageWorks Reference Information Storage System V1.4 User Guide (T3559-96028, December 2005)

Word characters and separators
Word characters include all uppercase and lowercase letters, digits, and the following additional
characters:
_(underscore)
# (number/pound/hash sign)
& (ampersand)
All other chara
cters are separators (except i n queries, wildcards ? and *, and special query characters
~, ", -,and!).
However, && by itself is not a word. It is a Boolea n operator. When combined with at least one more
word characte
r, && canbepartofaword. Forexample,a&&b is a word.
Query analysis and document indexing are not case-sensitive. Uppercase and lowercase letters are
treated the same.
Regular expression denition of English word characters
The following regular expression provides, in succinct form, a complete specication of English word
characters (except for treatment of && as a non-word):
[ A-Za-z0-9_#& ]+
Letters and digits in different character sets
Topics include:
Letters and digits dened,page72
Letters and
digits in les, page 72
Letters and digits dened
All letters and digits are word characters. What RISS considers a letter or digit depends on the character
set encoding used. For US ASC II encoding, letters are uppercase and lowercase English letters (A-Z, a-z).
For ISO 8859-1 (Latin-1) encoding, used for Western European languages, accented letters are included.
Most ideographic characters, such as those used in Asian languages, a re also considered letters.
Whatever the language and encoding used for a pa r ticular document (le or email message) , RISS maps
encoded characters to the Unicode 2.0 standard. The Unicode 2.0 standard is then used to determine if
a g iven character is a let ter or a digit (or neither):
A letter is any Unicode character in one of the following Unicode categ ories: Ll (lowercase letter),
Lu(uppercaseletter),Lt(titlecaseletter),Lm(modier letter), or Lo (other letter).
A digit is any Unicode character whose Unicode name contains the word DIGIT, provided it is not
in the range \u2000 (en quad = en space) through \u2FFF (ideographic description - future).
Letter
sanddigitsinles
Althou
gh all letters and digits are word characters, their treatment in les (including email message
attac
hments) de pends on the character encod ing used. You can search for any words in email m essage
bodie
s and headers, regardless of the encoding.
You ca
nsearchforwordsinles (including email body, header, attachments, and indexed documents)
prov
ided the character encoding is one the following:
72
Query expression syntax a nd matching