SQL/MX Release 2.0 Best Practices

ManualsBrandsHP ManualsServerHP NonStop G-Series

Performance Tuning 46

The optimizer chooses specific access plans in subsequent steps of the query based upon the number of

rows expected from earlier steps. When the difference between the predicted and actual values is as

great as several million, the subsequent query steps often cannot deliver adequate performance. In

cases of severe data skewing, the optimizer has little chance of producing a fair plan. The optimizer might

choose a plan that is fast and efficient for small quantities of data but very expensive when large results

are returned unexpectedly.

This problem was relatively simple for the customer to correct by explicitly setting the range of values in

the catalog to more accurately represent the volume of data that would be selected.

Skewed data also can occur as data is loaded into the fact tables. Another customer had a table that

used a partition number column as the partitioning key, over which data rows were distributed. The

original intent was to round-robin rows over the range of partition numbers during the load processing

each month. However, the actual distribution of values was less balanced, with some partitions holding

several times more values than other partitions. This situation meant that ESPs accessing a large range

within one partition delayed the processing for the entire query. This situation also caused an imbalance

in the system use because only a few processors and disks worked on the query while others became idle

within a short time.

The original design was based on distributing rows over a subset of data volumes, for example, one for

each processor. Other customers have used the same approach with fairly good results. The ideal way to

load the data is to distribute the data to be queried as evenly as possible across all processors, and

possibly across all partitions.

Opinions differ regarding whether the data should be distributed in striped sets (where each set consists

of one disk partition for each processor), or across all partitions for the table. One benefit of using all

partitions is that increasing the number of processors does not invalidate the original partitioning strategy.

Another advantage of using all partitions is that each partition is smaller than if the data were clustered

over a few partitions, and more of the partitions are likely to contain the requested data. If some

partitions are skipped because of skewed data, it is less likely that only a few ESPs will contribute to the

query.

Another common problem involving skewed data arises when the dimension data distribution is skewed

over the predicate column or columns. This problem is most apparent when the dimension is selected as

the outer table.

As the outer table, each ESP reads its partition to select qualifying rows. If most rows are clustered within a

single partition, one ESP has the majority of rows to process. This plan effectively degrades into a serial

plan.

The solution involves either indexing the dimension table (and the optimizer selecting a different plan) or

distributing the dimension rows to eliminate the clustering effect.

Addressing Problems in Large Tables

Large tables present unique challenges for performance optimization. These problems arise in several

ways:

• Access through non-key columns

• No use of access through a subset of the primary key and MDAM query technology

• Frequent access to detail data that is aggregated

• Joins between large tables

The first two problems can be addressed by creating secondary indexes on the tables. Refer to the

“Design guidelines and considerations” section of this document for additional considerations for

secondary indexes. Use the SQL/MX database runtime statistics (the DISPLAY STATISTICS command) and

DISPLAY_EXPLAIN output to determine whether secondary indexes would benefit query performance.

Consider creating a secondary index when the DISPLAY_EXPLAIN output indicates that a column

predicate having a low selectivity will result in a table scan and the runtime statistics indicate a large

difference between rows accessed and rows used.