SQL/MX Release 2.0 Best Practices

Possible Implementations 29

Possible Implementations

Join Efficiency

Primary-key order can have a large impact on join performance. A typical example involves a frequently

used dimension table that is joined to a fact table.

Assume that a predicate is supplied on a column of the dimension table. If the join column in the fact

table is the first key column, the SQL/MX optimizer may read the dimension table, select the rows that

match the predicate, and then join directly to the fact table. However, if the join column is very low in the

fact table key order and MDAM query technology cannot be used, the optimizer may scan the entire

fact table and perform a hash join to the dimension table. This can be an intensive operation.

When you can identify dimension tables that are likely candidates for all or most query access, consider

promoting the join column to a higher position in the primary key order of the fact table. It might be

sufficient to place this column immediately after a column that has low cardinality, where MDAM query

technology can be used.

When multiple tables (especially large tables) share similar key columns, consider using the same ordering

and partitioning schemes for all tables. Joins between tables that have the same key and partitioning

structures are very efficient because the query might benefit from partition-by-partition key-sequence

merge joins rather than less-efficient repartition joins.

Insert Efficiency

Although insert efficiency is an important consideration for performance, usually insert efficiency should

be secondary to other considerations unless specific circumstances indicate otherwise.

The choice of primary key determines the degree of parallelism available to a data-load application. For

example, consider what occurs when SALES_DATE is the high-order column of a fact table’s primary key.

If data is always loaded in the order of SALES_DATE, the parallelism of the load application is limited to the

number of volumes over which an individual date can be partitioned. Conversely, if the high-order

column is ITEM_NUMBER, parallelism is inherently greater because the data are partitioned over many

more volumes. One way to balance this situation is to use hash partitioning.

Usually load processing is a batch operation that occurs regularly (each day, week, or month) and is

bounded by the time window available for the load. Query processing occurs at all other times.

Generally you should optimize query processing, which occurs frequently, rather than optimizing less-

frequent batch load processing. Your circumstances may differ, however, so you must determine the

tradeoffs that are appropriate for your application.

For example, for some applications, the batch load processing must complete within a certain time. In

this case, you should optimize the batch load processing design for best performance, even if that means

some tradeoff in query performance. See the “Fact table partitioning techniques” and “Dimension table

partitioning techniques” sections of this document for more details.

Data Clustering

In addition to providing a fast and efficient access path, the primary key should be used to physically

cluster the data. This arrangement is advantageous for sequential access methods in primary key order,

which are often used in DSS query processing, because each I/O operation reads a large block of data

from disk and transfers that data to cached memory. This enhances query performance.

When a primary key involves data that has a skewed distribution, system performance can degrade. To

ensure a satisfactory distribution of data and provide optimal query performance, you must understand

the distribution of all the column values that comprise the key. Specifying more precise partitioning values

might be sufficient for dealing with skewed values. However, if the key contains values that are severely

skewed, creating a hash partition for more uniform distribution may be easier.