SQL/MX Release 2.0 Best Practices
Performance Tuning 46
The optimizer chooses specific access plans in subsequent steps of the query based upon the number of
rows expected from earlier steps. When the difference between the predicted and actual values is as
great as several million, the subsequent query steps often cannot deliver adequate performance. In
cases of severe data skewing, the optimizer has little chance of producing a fair plan. The optimizer might
choose a plan that is fast and efficient for small quantities of data but very expensive when large results
are returned unexpectedly.
This problem was relatively simple for the customer to correct by explicitly setting the range of values in
the catalog to more accurately represent the volume of data that would be selected.
Skewed data also can occur as data is loaded into the fact tables. Another customer had a table that
used a partition number column as the partitioning key, over which data rows were distributed. The
original intent was to round-robin rows over the range of partition numbers during the load processing
each month. However, the actual distribution of values was less balanced, with some partitions holding
several times more values than other partitions. This situation meant that ESPs accessing a large range
within one partition delayed the processing for the entire query. This situation also caused an imbalance
in the system use because only a few processors and disks worked on the query while others became idle
within a short time.
The original design was based on distributing rows over a subset of data volumes, for example, one for
each processor. Other customers have used the same approach with fairly good results. The ideal way to
load the data is to distribute the data to be queried as evenly as possible across all processors, and
possibly across all partitions.
Opinions differ regarding whether the data should be distributed in striped sets (where each set consists
of one disk partition for each processor), or across all partitions for the table. One benefit of using all
partitions is that increasing the number of processors does not invalidate the original partitioning strategy.
Another advantage of using all partitions is that each partition is smaller than if the data were clustered
over a few partitions, and more of the partitions are likely to contain the requested data. If some
partitions are skipped because of skewed data, it is less likely that only a few ESPs will contribute to the
query.
Another common problem involving skewed data arises when the dimension data distribution is skewed
over the predicate column or columns. This problem is most apparent when the dimension is selected as
the outer table.
As the outer table, each ESP reads its partition to select qualifying rows. If most rows are clustered within a
single partition, one ESP has the majority of rows to process. This plan effectively degrades into a serial
plan.
The solution involves either indexing the dimension table (and the optimizer selecting a different plan) or
distributing the dimension rows to eliminate the clustering effect.
Addressing Problems in Large Tables
Large tables present unique challenges for performance optimization. These problems arise in several
ways:
• Access through non-key columns
• No use of access through a subset of the primary key and MDAM query technology
• Frequent access to detail data that is aggregated
• Joins between large tables
The first two problems can be addressed by creating secondary indexes on the tables. Refer to the
“Design guidelines and considerations” section of this document for additional considerations for
secondary indexes. Use the SQL/MX database runtime statistics (the DISPLAY STATISTICS command) and
DISPLAY_EXPLAIN output to determine whether secondary indexes would benefit query performance.
Consider creating a secondary index when the DISPLAY_EXPLAIN output indicates that a column
predicate having a low selectivity will result in a table scan and the runtime statistics indicate a large
difference between rows accessed and rows used.