2022.1

Table Of Contents
Extracting data of variable length
In PDF and Text files, transactional data isn't structured uniformly, as in a CSV, database or
XML file. Data can be located anywhere on a page. Therefore, data are extracted from a
certain region on the page. However, the data can be spread over multiple lines and multiple
pages:
l Line items may continue on the next page, separated from the line items on the first page
by a page break, a number of empty lines and a letterhead.
l Data may vary in length: a product description for example may or may not fit on one line.
How to exclude lines from an extraction is explained in another topic: "Extracting transactional
data" on page263 (see From a PDF or Text file).
This topic explains a few ways to extract a variable number of lines.
Text file: setting the height to 0
If the variable part in a TXT file is at the end of the record (for example, the body of an email)the
height of the region to extract can be set to 0. This instructs the DataMapper to extract all lines
starting from the current position in a record until the end of the record, and store them in a
single field.
This also works with the data.extract() method in a script; see "extract()" on page428.
Finding a condition
Where it isn't possible to use a setting to extract data of variable length, the key is to find one or
more differences between lines that make clear how big the region is from where data needs to
be extracted.
Whilst, for example, a product description may extend over two lines, other data - such as the
unit price - will never be longer than one line. Either the area above or the one below the unit
price will be empty when the product description covers two lines.
Such a difference can then be used as a condition in a Condition step or a Case in a Multiple
Conditions step.
A Condition step, as well as each Case in a Multiple Conditions step, can only check for one
condition. To combine conditions, you would need a script.
Page 275