SQL/MX Data Mining Guide

Preparing the Data

HP NonStop SQL/MX Data Mining Guide—523737-001

2-2

Loading the Data

The first step in preparing a data set for mining is loading the data into database tables.

Suppose the credit card organization has a customers data warehouse. The customer

data and the account history data are stored in this warehouse. In a typical real-world

scenario, the warehouse could have millions of records representing millions of

customers dating back many years.

Creating the Database

Suppose a data mining database is created consisting of the Customers table and the

Account History table described in the previous section.

You can use the DDL scripts included with this manual to create a database to run the

examples in this manual. To create the database:

1. Open the .pdf file for this manual.

2. Navigate to Appendix A, Creating the Data Mining Database of this manual, which

contains the DDL script that creates the database.

3. On the tool bar, select the Table/Formatted Text Select Tool.

4. Copy and paste from the DDL script, one page at a time, into an OSS text file.

5. Within MXCI (the SQL/MX conversational interface), obey the OSS file you have

created.

Importing Data Into the Database

After the data mining database is created, the warehouse data is imported into the

database. In a typical real-world scenario, you would import the data by using some

type of database utility—for example, you can use the DataLoader/MP utility to import

a large quantity of data into an SQL/MP database. For further information, see the

DataLoader/MX Reference Manual and the SQL/MX Reference Manual for discussions

of the Import Utility.

Alternatively, you can also use INSERT statements to insert values into the data mining

database. The INSERT statements for the example in this manual are included in

Appendix B, Inserting Into the Data Mining Database.

Profiling the Data

Profiling often begins with the computation of basic information about each attribute.

For discrete attributes, this basic information is typically a table of the unique values

and a count of how many times each value occurs. However, as cardinality increases,

these frequencies become less and less meaningful. For continuous attributes, the

approach is to use metrics such as minimum, maximum, mean, and variance.