HP Software Reference Information Storage System v1.6 User Guide (T3559-90810, November 2008)

ManualsBrandsHP ManualsSoftwareHP StorageWorks Reference Information Storage System V1.5 Upgrade Kit

Word characters and separators

Word characters include all uppercase and lowercase letters, digits, and the following additional

characters:

• _(underscore)

• # (number/pound/hash sign)

• & (ampersand)

All other chara

cters are separators (excep t i n queries, wildcards ? and *, and special quer y characters

~, ", -,and!).

However, && by itself is not a word. It is a Boolean operator. When combined with at least one more

word characte

r, && canbepartofaword. Forexample,a&&b is a word.

Query analysis and document indexing are not case-sensitive. Uppercase and lowercase letters are

treated the same.

Regular expression deﬁnition of English word characters

The following regular expression provides, in succinct form, a complete speciﬁcation of English word

characters (except for treatment of && as a non-word):

[ A-Za-z0-9_#& ]+

Letters and digits in different character sets

Topics include:

•

Letters and digits deﬁned, page 38

•

Letters and

digits in ﬁles, page 38

Letters and digits deﬁ ned

All letters and digits are word characters. What RISS considers a letter or digit depends on the character

set encoding used. For US ASC II encoding, letters are uppercase and lowercase E nglish letters (A-Z, a-z).

For ISO 8859-1 (Latin-1) encoding, used for Western European languages, accented letters are included.

Most ideographic characters, such as those used in A sian la n guages, are also c onsidered letters.

Whatever the language and encoding used for a par ticular document ( ﬁle or email m essage), RISS maps

encoded characters to the Unicode 2.0 standard. The Unicode 2.0 standard is then used to determine if

a given character is a letter or a digit (or neither):

• A letter is any Unicode character in one of the following Unicode categories: Ll (lowercase letter),

Lu(uppercaseletter),Lt(titlecaseletter),Lm(modiﬁer letter), or Lo (other letter).

• A digit is any Unicode character whose Unicode name contains the word DIGIT, provided it is not

in the range \u2000 (en quad = en space) through \u2FFF (ideographic description - future) .

Letter

sanddigitsinﬁles

Althou

gh all letters a n d digits are word characters, their treatment in ﬁles (including email message

attac

hments) depends on the character encoding used. You can search for any words in email message

bodie

s and headers, regardless of the encoding.

You ca

nsearchforwordsinﬁles (including email body, header, at tachments, and indexed documents)

prov

ided the character encoding is one the following:

Query expression syntax and matching