2022.1

Table Of Contents

Extracting data of variable length

In PDF and Text files, transactional data isn't structured uniformly, as in a CSV, database or

XML file. Data can be located anywhere on a page. Therefore, data are extracted from a

certain region on the page. However, the data can be spread over multiple lines and multiple

pages:

l Line items may continue on the next page, separated from the line items on the first page

by a page break, a number of empty lines and a letterhead.

l Data may vary in length: a product description for example may or may not fit on one line.

How to exclude lines from an extraction is explained in another topic: "Extracting transactional

data" on page263 (see From a PDF or Text file).

This topic explains a few ways to extract a variable number of lines.

Text file: setting the height to 0

If the variable part in a TXT file is at the end of the record (for example, the body of an email)the

height of the region to extract can be set to 0. This instructs the DataMapper to extract all lines

starting from the current position in a record until the end of the record, and store them in a

single field.

This also works with the data.extract() method in a script; see "extract()" on page428.

Finding a condition

Where it isn't possible to use a setting to extract data of variable length, the key is to find one or

more differences between lines that make clear how big the region is from where data needs to

be extracted.

Whilst, for example, a product description may extend over two lines, other data - such as the

unit price - will never be longer than one line. Either the area above or the one below the unit

price will be empty when the product description covers two lines.

Such a difference can then be used as a condition in a Condition step or a Case in a Multiple

Conditions step.

A Condition step, as well as each Case in a Multiple Conditions step, can only check for one

condition. To combine conditions, you would need a script.

Page 275