SQL/MX Data Mining Guide

Preparing the Data
HP NonStop SQL/MX Data Mining Guide523737-001
2-5
Quick Profiling
NUMBER_CHILDREN ? 2 10
NUMBER_CHILDREN ? 3 3
--- 12 row(s) selected.
Because this query produces counts for four different attributes, use the ATTR column
to distinguish from which attribute the values are drawn. The C1 column contains the
values for the character attributes, and the C2 column contains the values for the
numeric attribute.
Example of Computing Statistics for Continuous Attributes
Similarly, a single query using TRANSPOSE can compute the necessary statistics for
all continuous attributes. The next query computes the minimum, maximum, mean, and
variance for the continuous attributes Customer Credit Limit and Balance, which are
both numeric:
SELECT attr, MIN(c1), MAX(c1), AVG(c1), VARIANCE(c1)
FROM acct_history
TRANSPOSE (1,cust_limit), (2,balance) AS (attr, c1)
GROUP BY attr
ORDER BY attr;
Sample results are:
Sample results are:
By using TRANSPOSE to compute attribute profiles, you gain performance and
scalability advantages. Performance is improved because the data set is scanned only
once. In addition, the number of queries is reduced to two: one for discrete attributes
and one for continuous attributes. Scalability is enhanced because the amount of data
accessed grows linearly with the number of attributes actually profiled.
Quick Profiling
The profiling step is highly iterative, because many different data sources are
inspected and evaluated for possible analysis. Getting a quick impression of an
attribute before proceeding to a more detailed profile is often necessary. For example,
by quickly estimating cardinality, you can determine whether to treat a column as
discrete or continuous. You can make a determination accurately without a scan of
every single data element.
Use the SQL/MX sampling feature to:
ATTR MIN(C1) MAX(C1) AVG(C1) VARIANCE(C1)
1 5000.00 40000.00 18225.81 2.01E+008
2 .00 32000.00 2539.12 1.46E+007
ATTR MIN(C1) MAX(C1) AVG(C1) VARIANCE(C1)
1 5000.00 40000.00 20139.86 2.35E+008
2 .00 32000.00 2444.17 1.584E+007