DataLoader/MX Reference Manual (G06.24+)

Running DataLoader/MX
DataLoader/MX Reference Manual—525872-002
3-11
Considerations—DataLoader/MX Process
where mydef has been previously added as a DEFINE.
Using the -P Parameter
The -P parameter is rarely used in a production data load but is very useful when you
initially set up a database. Suppose that you have a large amount of data, perhaps
containing hundreds of tapes. You know it contains 500,000,000 records. You want to
partition this data over ten disks but have only a rough idea of the key values. You
must know the exact key values to divide the data into even 50,000,000 record
partitions.
-P can provide these values. First run DataLoader/MX to get the boundaries, and then
set up your database with the partition boundaries you have determined.
You could create a version of DataLoader/MX, db10, that replaces the default
BUILDKEY with one that looks at each record and determines its key, and then run it
with this command:
$ db10 -I=infile -P=10
DataLoader/MX displays the partition boundaries in a report when it completes.
Even if you have a very large amount of data, DataLoader/MX can help. Suppose that
you have so much data that there is too much to sort, even if you just sort the keys.
You can use the % modifier to randomly sample the data. One option is to randomly
sample from the beginning of the data by using the parameters in this command:
$ dataload -I="infile(text,1%,MAX=100000)" -P=10
Random sampling selects 1% of the records and stops when 100,000 records have
been selected. DataLoader/MX can easily sort the key values for this many records.
Use this solution if you know that the key values are uniformly distributed throughout
the file. DataLoader/MX looks at less data, but the partition boundaries it gives you are
very accurate.
However, if the key values are not uniformly distributed over the entire file, or if you do
not know how the key values are distributed, the boundary values that DataLoader/MX
produces in this situation will not be accurate. You get more accurate values if you
lower the sampling percentage to 0.1% so that DataLoader/MX is forced to sample the
whole file. Then get the boundaries for the sample with this command:
$ dataload -I="infile(text,0.1%)" -P=10
DataLoader/MX uses FastSort, so you can use the =_SORT_DEFAULTS DEFINE
before running DataLoader/MX to specify sorting parameters.