SQL/MX Release 2.0 Best Practices
Performance Tuning  46 
The optimizer chooses specific access plans in subsequent steps of the query based upon the number of 
rows expected from earlier steps. When the difference between the predicted and actual values is as 
great as several million, the subsequent query steps often cannot deliver adequate performance. In 
cases of severe data skewing, the optimizer has little chance of producing a fair plan. The optimizer might 
choose a plan that is fast and efficient for small quantities of data but very expensive when large results 
are returned unexpectedly. 
This problem was relatively simple for the customer to correct by explicitly setting the range of values in 
the catalog to more accurately represent the volume of data that would be selected.  
Skewed data also can occur as data is loaded into the fact tables. Another customer had a table that 
used a partition number column as the partitioning key, over which data rows were distributed. The 
original intent was to round-robin rows over the range of partition numbers during the load processing 
each month. However, the actual distribution of values was less balanced, with some partitions holding 
several times more values than other partitions. This situation meant that ESPs accessing a large range 
within one partition delayed the processing for the entire query. This situation also caused an imbalance 
in the system use because only a few processors and disks worked on the query while others became idle 
within a short time.  
The original design was based on distributing rows over a subset of data volumes, for example, one for 
each processor. Other customers have used the same approach with fairly good results. The ideal way to 
load the data is to distribute the data to be queried as evenly as possible across all processors, and 
possibly across all partitions.  
Opinions differ regarding whether the data should be distributed in striped sets (where each set consists 
of one disk partition for each processor), or across all partitions for the table. One benefit of using all 
partitions is that increasing the number of processors does not invalidate the original partitioning strategy. 
Another advantage of using all partitions is that each partition is smaller than if the data were clustered 
over a few partitions, and more of the partitions are likely to contain the requested data. If some 
partitions are skipped because of skewed data, it is less likely that only a few ESPs will contribute to the 
query. 
Another common problem involving skewed data arises when the dimension data distribution is skewed 
over the predicate column or columns. This problem is most apparent when the dimension is selected as 
the outer table.  
As the outer table, each ESP reads its partition to select qualifying rows. If most rows are clustered within a 
single partition, one ESP has the majority of rows to process. This plan effectively degrades into a serial 
plan. 
The solution involves either indexing the dimension table (and the optimizer selecting a different plan) or 
distributing the dimension rows to eliminate the clustering effect. 
Addressing Problems in Large Tables 
Large tables present unique challenges for performance optimization. These problems arise in several 
ways:  
•  Access through non-key columns 
•  No use of access through a subset of the primary key and MDAM query technology  
•  Frequent access to detail data that is aggregated 
•  Joins between large tables 
The first two problems can be addressed by creating secondary indexes on the tables. Refer to the 
“Design guidelines and considerations” section of this document for additional considerations for 
secondary indexes. Use the SQL/MX database runtime statistics (the DISPLAY STATISTICS command) and 
DISPLAY_EXPLAIN output to determine whether secondary indexes would benefit query performance. 
Consider creating a secondary index when the DISPLAY_EXPLAIN output indicates that a column 
predicate having a low selectivity will result in a table scan and the runtime statistics indicate a large 
difference between rows accessed and rows used. 










