HP Reference Information Storage System Version 1.6 User Guide revision 2 (T3559-90810, August 2007)

Word characters and separators

Word characters include all uppercase and lowercase letters, digits, and the following additional

characters:

• _(underscore)

• # (number/pound/hash sign)

• & (ampersand)

All other chara

cters are separators (excep t i n queries, wildcards ? and *, and special quer y characters

~, ", -,and!).

However, && by itself is not a word. It is a Boolean operator. When com bined with at least one more

word characte

r, && canbepartofaword. Forexample,a&&b is a word.

Query analysis and document indexing are not case-sensitive. Uppercase and lowercase letters are

treated the same.

Regular expression deﬁnition of English word characters

The following regular expression provides, in succinct form, a complete speciﬁcation of English word

characters (except for treatment of && as a non-word):

[ A-Za-z0-9_#& ]+

Letters and digits in different character sets

Topics include:

•

Letters and

digits deﬁned, page 34

•

Letters and digits in ﬁles, page 34

Letters and digits deﬁ ned

All letters and digits are word characters. What RISS considers a letter or digit depends on the character

set encoding used. For US ASCII encoding, letters are uppercase a nd lowercase English letters (A-Z, a-z) .

For ISO 8859-1 (Latin-1) encoding, used for Western European languages, accented letters are included.

Most ideographic characters, such as those used in A sian la n guages, are also c onsidered letters.

Whatever the language and encoding used for a par ticular document ( ﬁle or email m essage), RISS maps

encoded characters to the Unicode 2.0 standard. The Unicode 2.0 standard is then used to determine if

a given character is a letter or a digit (or neither):

• A letter is any Unicode character in one of the following Unicode categories: Ll (lowercase letter),

Lu(uppercaseletter),Lt(titlecaseletter),Lm(modiﬁer letter), or Lo (other letter).

• A digit is any Unicode character whose Unicode name contains the word DIGIT, provided it is not

in the range \u2000 (en quad = en space) through \u2FFF (ideographic description - future) .

Letters and digits in ﬁles

Although all letters and digits are word characters, their treatment in ﬁles (including email message

attachments) depends on the character encoding used. You can search for any words in email message

bodies and headers, regardless of the encoding.

You can search for words in ﬁles (including email body, header, at tachments, and indexed documents)

provided the character encoding is one the following:

Query expression syntax and matching