HP IAP Version 2.1 User Guide, March 2011

Proximity word sequences

You can use simple word sequences to search for words separated by separators but not by other

words. To search for document words that are in an ordered sequence, but might be separated by

other words, use a proximity word sequence.

To write a proximity word sequence, use the same syntax as a simple word sequence, but append a

tilde (~) character to the second quote, and follow that with a numeric proximity value. The proximity

value represents the maximum number of other document words that can occur between any two

successive words of the sequence. A document matches a proximity word sequence if all words occur

in the document in the same order, with at most N intervening words, where N is the proximity value.

For example, the sequence "bird garden stone"~3 matches any document that has these three

words in this order, with bird and garden separated by no more than three words, and garden and

stone separated by no more than three words. This sequence matches a document with the text a bird

in the rose garden is near a stone because there are at most three words between successive sequence

words. This sequence also matches a bird garden with a stone for the same reason.

Simple word sequences are a special case of proximity word sequences: ". . ." is the same as

". . ."~0. Any documents found by ". . ."~N are also found by ". . ."~M, when M > N.

Matching word sequences in files and email attachments

IAP renders files and email attachments (like spreadsheets and PDF files) into text words. When the

IAP renders a document, it follows the document application's internal representation of the file.

Certain file types, for example spreadsheets, look very different internally than they do externally.

This means that word sequence in the external application representation which the end user sees

may differ from the internal application representation. IAP query matching uses the internal application

representation.

Separators are ignored

IAP renders text into words. Remaining characters such as periods, commas, spaces, and newlines

are considered separators and are ignored. Phrase queries ignore all formatting elements and non-word

characters. The following original plain text of:

“This was news to Mr. Smith. Johnson, however, knew better.”

matches the phrase query of:

“Smith Johnson”

This is because internally, the two plain text sentences are represented as one long string of continuous

words: “This was news to Mr Smith Johnson however knew better.”

Sequence is not intuitive

Internally in the file's original application, a large multi-page document or a single page spreadsheet

equates to a long text sequence. Text may not appear in the same sequence internally as it appears

externally. Also, multiple instances of the same text in certain file types are represented as a single

instance.

Query expression syntax and matching56