SQL/MX 3.2.1 Query Guide (H06.26+, J06.15+)

Keeping Statistics Current

HP NonStop SQL/MX Release 3.2.1 Query Guide—663851-003

3-7

Performance Issues and Accuracy in Sampling

From SQL/MX 2.3.2 onwards, it is also possible to push the sampling operation down

into DP2. As a result, the number of read and discarded records reduces, thereby

improving the performance of the command. This technique is effective for low

sampling percentages. Use the ALLOW_DP2_ROW_SAMPLING default attribute to

control the sampling. The attribute values are SYSTEM, ON, and OFF. The default

value is SYSTEM. When the attribute is set to SYSTEM, UPDATE STATISTICS

pushes the sampling down into DP2 for sampling percentages up to 5%. When it is set

to ON, UPDATE STATISTICS pushes the sampling down into DP2 for sampling

percentages up to 50%. When set to OFF, the sampling is not pushed down into DP2.

When you use the sampling option, you can increase the efficiency by specifying a row

count for the columns you want to update. The next statement provides a way to obtain

the row count for the EMPNUM column of the EMPLOYEE table:

SELECT COUNT(*) FROM EMPLOYEE;

Use the value returned from running the SELECT COUNT (*) statement (125000) to

specify the valid row count in the UPDATE STATISTICS statement:

UPDATE STATISTICS FOR TABLE EMPLOYEE ON (EMPNUM)

SAMPLE 5000 ROWS SET ROWCOUNT 125000

Performance Issues and Accuracy in Sampling

Certain sampling options provide more accurate statistics than others, but higher

accuracy can mean performance trade-offs. When you need high accuracy or quality in

your sample, use random row sampling. In this method, the intermediate table being

sampled is scanned, and each row is selected with probability n/100, where n is the

sample percentage.

A faster alternative to random row sampling is cluster sampling. Use this method when

performance is a high priority, or when the quality of a sample is determined to be good

enough, based on the knowledge of the table for which statistics are being updated.

In cluster sampling, only disk blocks that are actually part of the result set are read and

processed. If the sampling percentage is small, which is typically the case, the

performance advantage of cluster sampling over other sampling methods can be

dramatic. For example, consider two queries, one that selects a 1 PERCENT row

sample of a table and another that selects a 1 PERCENT cluster sample of the same

table. Because cluster sampling requires reading and processing only approximately 1

percent of the table, and row sampling reads and processes the entire table, cluster

sampling can be up to 100 times faster than random row sampling. However, the trade-

off for this performance is less accuracy than random row sampling provides.

Other sampling methods, such as PERIODIC, do not provide additional performance or

accuracy but provide semantics that are appropriate in certain situations. Use

PERIODIC sampling (for example, choosing 1 of every 10 rows) when it is important to

compute statistics from all parts of a table. For example, if a table is ordered by

timestamp and column values vary widely over time, the use of PERIODIC sampling

ensures that statistics are computed from rows spread evenly throughout the entire

table, resulting in accurate statistics.

TP663851.fm Page 7 Wednesday, January 30, 2013 5:37 PM