DataLoader/MX Reference Manual (G06.24+)

Running DataLoader/MX

DataLoader/MX Reference Manual—525872-002

3-11

Considerations—DataLoader/MX Process

where mydef has been previously added as a DEFINE.

Using the -P Parameter

The -P parameter is rarely used in a production data load but is very useful when you

initially set up a database. Suppose that you have a large amount of data, perhaps

containing hundreds of tapes. You know it contains 500,000,000 records. You want to

partition this data over ten disks but have only a rough idea of the key values. You

must know the exact key values to divide the data into even 50,000,000 record

partitions.

-P can provide these values. First run DataLoader/MX to get the boundaries, and then

set up your database with the partition boundaries you have determined.

You could create a version of DataLoader/MX, db10, that replaces the default

BUILDKEY with one that looks at each record and determines its key, and then run it

with this command:

$ db10 -I=infile -P=10

DataLoader/MX displays the partition boundaries in a report when it completes.

Even if you have a very large amount of data, DataLoader/MX can help. Suppose that

you have so much data that there is too much to sort, even if you just sort the keys.

You can use the % modifier to randomly sample the data. One option is to randomly

sample from the beginning of the data by using the parameters in this command:

$ dataload -I="infile(text,1%,MAX=100000)" -P=10

Random sampling selects 1% of the records and stops when 100,000 records have

been selected. DataLoader/MX can easily sort the key values for this many records.

Use this solution if you know that the key values are uniformly distributed throughout

the file. DataLoader/MX looks at less data, but the partition boundaries it gives you are

very accurate.

However, if the key values are not uniformly distributed over the entire file, or if you do

not know how the key values are distributed, the boundary values that DataLoader/MX

produces in this situation will not be accurate. You get more accurate values if you

lower the sampling percentage to 0.1% so that DataLoader/MX is forced to sample the

whole file. Then get the boundaries for the sample with this command:

$ dataload -I="infile(text,0.1%)" -P=10

DataLoader/MX uses FastSort, so you can use the =_SORT_DEFAULTS DEFINE

before running DataLoader/MX to specify sorting parameters.