SQL/MX UPDATE STATISTICS Enhancements

• Real customer data

The tables ranged in size from approximately 10 million to 1 billion rows, and the UPDATE

STATISTICS sample sizes ranged from 0.1 to 20 percent.

For each data set and sample size, UPDATE STATISTICS was run both with and without the accuracy

enhancement. For each column tested, the estimated UEC was retrieved from the histogram tables and

recorded. The UEC estimates were then compared to the actual or true UEC for each column, and the

error was determined. For each test table, the results presented in this paper show the average UEC

estimation error for the columns on which statistics were updated.

The UEC estimation error for a column was calculated according to this formula:

Error = |estimatedUEC – actualUEC| / actualUEC * 100

For example, suppose this UPDATE STATISTICS statement is executed:

UPDATE STATISTICS FOR TABLE Employee ON (Salary) SAMPLE RANDOM 1 PERCENT;

As part of executing this statement, UPDATE STATISTICS stores the estimated UEC for the Salary

column in the histogram metadata tables. Assume for this example that the UEC estimated by

UPDATE STATISTICS from the 1 percent random sample is 45,000, while the actual UEC is 50,000.

The estimation error, in this case, is 10 percent (|45,000 – 50,000| / 50,000 * 100).

Test Results

The first test data set consisted of SQL/MX tables containing synthetic data drawn from Uniform,

Normal, and Zipf distributions. In the Uniform distribution, distinct values occur an equal number of

times. This distribution was included in the testing because it is well understood, and generating data

from it is straightforward. The Normal distribution is the well-known bell curve that has a frequently-

occurring mean, or average value, and then values that occur less frequently as the distance from the

mean increases. Many phenomena are normally distributed, such as the height of humans and

measurements like reading ability and job satisfaction. The Zipf distribution is characterized by high

skew, where there are a few very large values and many smaller values. Examples of this distribution

are city populations, frequency of access to Web pages, and corporate revenue.

Each of the test tables contained 10 million rows and three columns on which histograms were built.

Sample sizes from 1 to 20 percent were tested. Row-level sampling was used in all cases. UPDATE

STATISTICS also supports cluster sampling, which is an alternative to row-level sampling. In row-level

sampling, a table is scanned and a test performed on each row to see if it should be included in the

sample. The outcome of this test depends only on the specified sample size, and not on the values in

a row. In cluster sampling, a similar scan is done, but the test is performed on each cluster (a

contiguous set of disk blocks), rather than on individual rows. If a cluster is selected, all the rows in

the cluster are added to the sample. Cluster sampling is much faster than row sampling. Clusters that

are not selected for a sample are skipped and not actually read from disk, thus there are fewer I/O

operations compared to row sampling. However, cluster sampling tends to produce samples that are

less random, which generally results in less-accurate statistics.

The test results for the first data set are shown in this table: