SQL/MX 3.2.1 Query Guide (H06.26+, J06.15+)

Keeping Statistics Current
HP NonStop SQL/MX Release 3.2.1 Query Guide663851-003
3-7
Performance Issues and Accuracy in Sampling
From SQL/MX 2.3.2 onwards, it is also possible to push the sampling operation down
into DP2. As a result, the number of read and discarded records reduces, thereby
improving the performance of the command. This technique is effective for low
sampling percentages. Use the ALLOW_DP2_ROW_SAMPLING default attribute to
control the sampling. The attribute values are SYSTEM, ON, and OFF. The default
value is SYSTEM. When the attribute is set to SYSTEM, UPDATE STATISTICS
pushes the sampling down into DP2 for sampling percentages up to 5%. When it is set
to ON, UPDATE STATISTICS pushes the sampling down into DP2 for sampling
percentages up to 50%. When set to OFF, the sampling is not pushed down into DP2.
When you use the sampling option, you can increase the efficiency by specifying a row
count for the columns you want to update. The next statement provides a way to obtain
the row count for the EMPNUM column of the EMPLOYEE table:
SELECT COUNT(*) FROM EMPLOYEE;
Use the value returned from running the SELECT COUNT (*) statement (125000) to
specify the valid row count in the UPDATE STATISTICS statement:
UPDATE STATISTICS FOR TABLE EMPLOYEE ON (EMPNUM)
SAMPLE 5000 ROWS SET ROWCOUNT 125000
Performance Issues and Accuracy in Sampling
Certain sampling options provide more accurate statistics than others, but higher
accuracy can mean performance trade-offs. When you need high accuracy or quality in
your sample, use random row sampling. In this method, the intermediate table being
sampled is scanned, and each row is selected with probability n/100, where n is the
sample percentage.
A faster alternative to random row sampling is cluster sampling. Use this method when
performance is a high priority, or when the quality of a sample is determined to be good
enough, based on the knowledge of the table for which statistics are being updated.
In cluster sampling, only disk blocks that are actually part of the result set are read and
processed. If the sampling percentage is small, which is typically the case, the
performance advantage of cluster sampling over other sampling methods can be
dramatic. For example, consider two queries, one that selects a 1 PERCENT row
sample of a table and another that selects a 1 PERCENT cluster sample of the same
table. Because cluster sampling requires reading and processing only approximately 1
percent of the table, and row sampling reads and processes the entire table, cluster
sampling can be up to 100 times faster than random row sampling. However, the trade-
off for this performance is less accuracy than random row sampling provides.
Other sampling methods, such as PERIODIC, do not provide additional performance or
accuracy but provide semantics that are appropriate in certain situations. Use
PERIODIC sampling (for example, choosing 1 of every 10 rows) when it is important to
compute statistics from all parts of a table. For example, if a table is ordered by
timestamp and column values vary widely over time, the use of PERIODIC sampling
ensures that statistics are computed from rows spread evenly throughout the entire
table, resulting in accurate statistics.
TP663851.fm Page 7 Wednesday, January 30, 2013 5:37 PM