HP StorageWorks Reference Information Storage System V1.4 User Guide (T3559-96028, December 2005)
• A ? matches any single character in a document word. For example, b??t matches beat, beet,
boat, blot, best, bust, bout,andsoon.
• An * matches any sequence of characters in a document word, including a sequence of no
characters. For example, f*t matches the document words foot, feet, fit, fault,andft;andf*
matches any document word beginning with f.
You can use any number of wildcard characters (* or ?) in a query word, but you cannot use a wildcard
at the beginning of a query word. An error message results. For example, *ion is not a valid query.
Matching similar words
Topics include:
•
Fuzzy words,
page 74
•
Measuring word similarity, page 74
Fuzzy words
You can search for document words tha t are textually similar to a given literal quer y word (that is, one
containing no wildcards). To do this, append a tilde ( ~ ) character to the word, creating a fuzzy word.
For example, the fuzzy word define~ matches the similar words defined and definite,butdoesnot
match defining, definition, indefinite,orpine.Italsomatchesdefine itself.
Measuring word similarity
The edit distance (also called Levenshtein distance) between two words is the number of single-character
operations (deletion, replacement, or insertion) required to change one word into the other word.
For example, the edit distance between define and pine is three: two deletions (d and e)andone
replace
ment (f by p). The distance between define and definite is also three (e replaced by i; te inserted) .
Thesearchengineconsidersdefine more similar to definite than to pine,eventhoughtheeditdistances
are the same (three) , because the edit distance (number of character changes) is compared to the word
length (of the shorter of the query and document words). Two words are closer, for querying purposes, if
it takes less to change one word into the other word relative to their lengths.
The similarity ratio used by the search engine is d/min(query, doc), where d is the edit distance, min is a
function that returns the lesser of its arguments, and query and doc are the lengths of the query word and
document word, respectively . A fuzzy word matches a document word if this ratio is no more than 0.5.
Examp
les:
Words Compared Similarity Ratio Match ?
define, definite 3/min(6, 8) = 3/6 = 0.5
yes
define
,pine
3/min
(6,4)=3/4=0.75
no(0.75>0.5)
Matching word sequences
You can use word sequences to find documents with words that occur in a specified order and are
separated by a specified maximum distance.
Topics include:
•Simplewordsequences, page 75
• Proximity word sequences,page75
74
Query expression syntax a nd matching