HP StorageWorks Reference Information Storage System V1.1 User Guide (February 2005)

Query expression syntax and matching Chapter 6:
Query syntax and matching
HP StorageWorks Reference Information Storage System User Guide, February 2005 6-7
Matching similar words
Fuzzy words
You can search for document words that are textually similar to a given literal
query word (that is, one that contains no wildcards) by appending a tilde (
~
)
character to the word, creating a
fuzzy word
. For example, the fuzzy word
define~
matches similar words such as
defined
and
definite
(but not
defining
,
definition
,
indefinite,
or
pine
). It will also match
define
itself.
How word similarity is measured
Note:
This section provides an in-depth explanation of how word simi-
larity is measured. In most cases, you do not need to be
concerned with just how similar two words must be in order to
match. However, when interpreting the results of complex
queries, this information can help you better understand why
you obtain the results you do.
The edit distance (also called Levenshtein distance) between two words is the
number of single-character operations needed to change one into the other,
where an operation is a deletion, replacement, or insertion.
For example, the edit distance between
define
and
pine
is three: two deletions
(
de
) and one replacement (
f
by
p
). The distance between
define
and
definite
is
also three (
e
replaced by
i
;
te
inserted).
So, why does the search engine consider
define
more similar to
definite
than to
pine
, even though the edit distances are the same (three)? Because the edit
distance (number of character changes) is compared to the word length (of the
shorter of the query and document words). Two words are closer, for purposes
of querying, if it takes less to change one into the other, relative to their
lengths.
The similarity ratio used by the search engine is d/min(query, doc), where d
is the edit distance, min is a function that returns the lesser of its arguments,
and query and doc are the lengths of the query word and document word,
respectively. A fuzzy word
matches
a document word if this ratio is no more
than 0.5.
Examples: