HP StorageWorks Reference Information Storage System V1.5 User Guide (T3559-96045, November 2008)

Word characters and separators
Word characters include all uppercase and lowercase letters, digits, and the following additional
characters:
_(underscore)
# (number/pound/hash sign)
& (ampersand)
All other chara
cters are separators (except in queries, wildcards ? and *, a nd special query characters
~, ", -,and!).
However, && by itself is not a word. It is a Boolean operator. When combined with at least one more
word characte
r, && canbepartofaword.Forexample,a&&b is a word.
Query analysis and document ind exing are not case-sensitive. Uppercase and lowercase letters are
treated the same.
Regular expression denition of English word characters
The following regular expression provides, in succinct form, a complete specication of English word
characters (except for treatm ent of && as a non-word):
[ A-Za-z0-9_#& ]+
Letters and digits in different character sets
Topics include:
Letters and
digits dened, page 48
Letters and digits in les, page 48
Letters and digits dened
All letters and digits are word characters. What RISS considers a let ter or digit depends on the character
set encoding used. For US ASCII encoding, let ters are uppercase and lowercase English letters (A-Z, a-z).
For ISO 8859-1 (Latin-1) enc oding, used for Western European languages, accented letters are included.
Most ideographic characters, s uch as those used in Asian lan gu ages, are also considered letters.
Whatever the language and encoding used for a par ticular document (le or email message), RISS maps
encoded characters to the Unicode 2.0 standard. The Unicode 2.0 standard is then used to determine if
a given character is a letter or a digit (or neither) :
A letter is any Unicode character in one of the following Unicode categories: Ll (lowercase letter),
Lu(uppercaseletter),Lt(titlecaseletter),Lm(modier let ter) , or Lo (other letter).
A digit is any Unicode character whose Unicode name contains the word DIGIT, provided it is not
in the range \u2000 (en quad = en space) through \u2FFF (ideographic description - future).
Letters and digits in les
Although all letters a n d digits are word characters, their treatment in les (including email message
attachments) depends on the character encoding used. You can search for any words in email message
bodies and headers, regardless of the encoding.
You can search for words in les (including email body, hea der, attachments, and indexed documents)
provided the character encoding is one the following:
48
Query expression syntax and matching