SQL/MX Data Mining Guide
Preparing the Data
HP NonStop SQL/MX Data Mining Guide—523737-001
2-2
Loading the Data
Loading the Data
The first step in preparing a data set for mining is loading the data into database tables.
Suppose the credit card organization has a customers data warehouse. The customer
data and the account history data are stored in this warehouse. In a typical real-world
scenario, the warehouse could have millions of records representing millions of
customers dating back many years.
Creating the Database
Suppose a data mining database is created consisting of the Customers table and the
Account History table described in the previous section.
You can use the DDL scripts included with this manual to create a database to run the
examples in this manual. To create the database:
1. Open the .pdf file for this manual.
2. Navigate to Appendix A, Creating the Data Mining Database of this manual, which
contains the DDL script that creates the database.
3. On the tool bar, select the Table/Formatted Text Select Tool.
4. Copy and paste from the DDL script, one page at a time, into an OSS text file.
5. Within MXCI (the SQL/MX conversational interface), obey the OSS file you have
created.
Importing Data Into the Database
After the data mining database is created, the warehouse data is imported into the
database. In a typical real-world scenario, you would import the data by using some
type of database utility—for example, you can use the DataLoader/MP utility to import
a large quantity of data into an SQL/MP database. For further information, see the
DataLoader/MX Reference Manual and the SQL/MX Reference Manual for discussions
of the Import Utility.
Alternatively, you can also use INSERT statements to insert values into the data mining
database. The INSERT statements for the example in this manual are included in
Appendix B, Inserting Into the Data Mining Database.
Profiling the Data
Profiling often begins with the computation of basic information about each attribute.
For discrete attributes, this basic information is typically a table of the unique values
and a count of how many times each value occurs. However, as cardinality increases,
these frequencies become less and less meaningful. For continuous attributes, the
approach is to use metrics such as minimum, maximum, mean, and variance.