SQL/MX Release 2.0 Best Practices

Possible Implementations 29
Possible Implementations
Join Efficiency
Primary-key order can have a large impact on join performance. A typical example involves a frequently
used dimension table that is joined to a fact table.
Assume that a predicate is supplied on a column of the dimension table. If the join column in the fact
table is the first key column, the SQL/MX optimizer may read the dimension table, select the rows that
match the predicate, and then join directly to the fact table. However, if the join column is very low in the
fact table key order and MDAM query technology cannot be used, the optimizer may scan the entire
fact table and perform a hash join to the dimension table. This can be an intensive operation.
When you can identify dimension tables that are likely candidates for all or most query access, consider
promoting the join column to a higher position in the primary key order of the fact table. It might be
sufficient to place this column immediately after a column that has low cardinality, where MDAM query
technology can be used.
When multiple tables (especially large tables) share similar key columns, consider using the same ordering
and partitioning schemes for all tables. Joins between tables that have the same key and partitioning
structures are very efficient because the query might benefit from partition-by-partition key-sequence
merge joins rather than less-efficient repartition joins.
Insert Efficiency
Although insert efficiency is an important consideration for performance, usually insert efficiency should
be secondary to other considerations unless specific circumstances indicate otherwise.
The choice of primary key determines the degree of parallelism available to a data-load application. For
example, consider what occurs when SALES_DATE is the high-order column of a fact table’s primary key.
If data is always loaded in the order of SALES_DATE, the parallelism of the load application is limited to the
number of volumes over which an individual date can be partitioned. Conversely, if the high-order
column is ITEM_NUMBER, parallelism is inherently greater because the data are partitioned over many
more volumes. One way to balance this situation is to use hash partitioning.
Usually load processing is a batch operation that occurs regularly (each day, week, or month) and is
bounded by the time window available for the load. Query processing occurs at all other times.
Generally you should optimize query processing, which occurs frequently, rather than optimizing less-
frequent batch load processing. Your circumstances may differ, however, so you must determine the
tradeoffs that are appropriate for your application.
For example, for some applications, the batch load processing must complete within a certain time. In
this case, you should optimize the batch load processing design for best performance, even if that means
some tradeoff in query performance. See the “Fact table partitioning techniques” and “Dimension table
partitioning techniques” sections of this document for more details.
Data Clustering
In addition to providing a fast and efficient access path, the primary key should be used to physically
cluster the data. This arrangement is advantageous for sequential access methods in primary key order,
which are often used in DSS query processing, because each I/O operation reads a large block of data
from disk and transfers that data to cached memory. This enhances query performance.
When a primary key involves data that has a skewed distribution, system performance can degrade. To
ensure a satisfactory distribution of data and provide optimal query performance, you must understand
the distribution of all the column values that comprise the key. Specifying more precise partitioning values
might be sufficient for dealing with skewed values. However, if the key contains values that are severely
skewed, creating a hash partition for more uniform distribution may be easier.