SQL/MX Best Practices white paper Part number: 540372-003 Second edition: 01/2007
Legal notices © Copyright 2006 Hewlett-Packard Development Company, L.P. The information herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein.
Contents Objectives of This Document Project Planning HP Roles and Responsibilities...............................................................................................................................7 Roles .......................................................................................................................................................................7 Technical Lead...............................................................................................................
Data Locality Objectives ...................................................................................................................................21 Additional Information .......................................................................................................................................22 Renaming a Table.........................................................................................................................................22 Limits .............................
Contributors to this Document Contents 5
Objectives of This Document This document describes the leading best practices for designing, implementing, and supporting databases using NonStop SQL/MX. These best practices were gained from time-tested customer experiences. Many of these techniques apply to any type of large database environment, including decision-support systems (DSS), data warehouses, operational data stores (ODS), and online transaction processing (OLTP) applications, as well as applying to smaller data warehouses.
Project Planning HP Roles and Responsibilities This list of suggested roles and responsibilities is intended to move a typical database project toward a successful conclusion. One person can assume more than one role, although where a conflict exists between roles and the project timeline, separate individuals should take the separate roles.
• Reviews queries • Explains plans • Is responsible for the physical design of the database based on accepted database design principles • Facilitates a transfer of information of the physical design of the database to customer personnel • Thoroughly understands SQL/MX software technology and its application to a database that must support database activities • Understands the implications of the design on load, maintenance, and archive activities • Works closely with the designer of the logica
Project Activities Guideline for Database Activities This section lists activities that have been successfully employed at HP accounts. The list is intended as a guide to acquiring the information and data necessary to implement a database solution successfully. Assumptions include the use of a phased approach to limit the scope of the project and to implement additional functionality in subsequent phases.
Query Analysis • Query types. Determine the query types and rank by importance. Group them by complexity. Determine frequency of execution. Limit your analysis to approximately 10 to 20 queries. More would be too complex; fewer would not represent business requirements. • Query structure. Are the queries produced by individuals or intelligent clients? Are there multistep or multipass queries? • Response time requirements. Determine the response-time requirements for each type of query.
• Primary key analysis of fact tables. Are multicolumn keys needed or desired? Determine key distributions by partition. Consider the effect on MDAM query technology processing. • Primary key analysis of dimension tables. For a star schema, typically choose single-column keys. To use multicolumn keys, you must revisit the data model. Is the snowflake design technique a better choice? • Map partitions to configuration. Determine which partitions of which tables to collocate.
• Trickle compared to batch updates. Compare trickle against batch. What is the impact on rerunning queries and on query performance? Consider the size of the load and update window and the impact on aggregates. • Custom versus DataLoader/MX versus tool. Consider the need for parallelism and for robust architecture. Compare flexibility against a “canned” approach, partition management needs, and so on. • Coded data insert and update processes.
Design Guidelines and Considerations Physical System Configuration Guidelines Distribute the disk subsystem across all processors to balance more evenly the I/O processing activity and thereby minimize processor and disk bottlenecks. Construct a symmetric hardware configuration where all hardware (especially disk devices) is distributed over all processors as evenly as possible. Some customers might need to obtain additional I/O enclosures to achieve a balance.
default, the clustering key is the primary key, but a separate clustering key can be specified with the STORE BY clause. When a primary key is specified for a table, the clustering key can comprise all or some of the columns of the primary key. If only some of the primary key columns are used for clustering, you must use the leading columns of the primary key in the STORE BY clause. When all or some of the primary key columns are used in the STORE BY clause, the data is clustered by the primary key.
• Join efficiency • Insert efficiency • Data clustering You must determine whether to create the index before or after the data is loaded to the table. If you create the index first, the index will be a constraint, limiting the ability to insert duplicate data. If you create the index after the data is loaded to the table, you may need to eliminate any duplicates prior to creating the unique index.
particular high-cardinality column will always be included in a query predicate, use that column as part of the primary key because that column will immediately restrict the query search space. Be aware of the following situation associated with low-cardinality columns. Consider a case where the primary key is composed of a very low cardinality column (for example, 1 or 2) that is always specified in the query, followed by a column that has very high cardinality that is rarely included in a query predicate.
In the absence of query requirements, designers often optimize for the requirements they understand, usually database loading. Although such a design is not necessarily bad, it often produces good performance for database loading, but poor performance for query processing. Whenever possible, try to anticipate the query requirements in your design decisions. At the very least, set the expectation that portions of the database might need to be modified (and programs changed) once query requirements are known.
Database Sizing Considerations General Rules for Sizing This section includes some general rules for sizing a database system when very few details are known about the application. • A processor supports about 30 to 50 gigabytes of data (logical data, based on common customer implementations); the number varies based on the processors and types of storage chosen. Please consult your hardware representative for your processor and storage configuration abilities. • The total table space is about 1.
Network Data For each legacy system involved: • What is the network connection? • Are multiple links to the legacy systems needed or desired? • What is the preferred transfer mechanism—for example, file transfer protocol (FTP) or network data management protocol (NDMP)? • Does the transfer mechanism support the anticipated volumes and cycles of updates? (This consideration is not really a sizing parameter, but rather a “sanity check” regarding the preferred mechanism.
In NonStop SQL/MX, a scratch file can overflow to another disk. So, if a scratch file becomes full or if the disk becomes full, the operation does not necessarily fail. An additional scratch file on another disk is selected (using the criterion procedure). As a result, there is no 2 GB limit on scratch space. In NonStop SQL/MX, the operations that can create scratch files are sort, hash join, and hash groupby. They all use the criterion procedure to determine which scratch disk to use.
Carefully determine the space requirements of summary tables. Often, summary tables are larger than expected due to the density of the dimensional values when aggregated compared with the sparsity of the dimensional values in the fact table. If fact table updates are permitted, you must determine whether the summary tables should be updated or rebuilt. Correctly updating the summary tables is challenging, while rebuilding the summary tables requires many processing cycles.
When designing a database distributed across multiple systems, you have several alternatives for locating tables. You must consider the location of the fact, dimension, and summary tables, as well as secondary indexes. You might choose to keep each table entirely on one system, but distribute the collection of tables over all systems, which is feasible if the division of tables coincides with the workload division.
No restrictions exist on the number of partitions that an index supports; however, beyond 512 partitions, performance and memory issues can occur. • INSERT operations 150 records can be inserted into SQL/MX tables in a single INSERT operation. 250 records can be inserted into SQL/MP tables in a single INSERT operation. • Joins 40 tables can be joined, including base tables of views, but joining more tables affects performance.
This practice enables RDF to properly map files between primary and backup systems. If you allow the system to generate the names, there will be incompatible names between your primary and secondary systems. Using SQLMXBUFFER to Improve Database Performance SQL/MX Session Data Space and Data Cache Guidelines When a SQL statement is compiled, a query plan is created.
MAX under 4KB Blocks Attempts under Reuse OK under Reuse Failed FST and Failed ID under Reuse The amount of 4 KB blocks that can be used within the SQL/MX buffer space for query fragments. The number of times an attempt was made to reuse information in the SQL/MX buffer. The number of times an attempt to reuse information in the SQL/MX buffer was successful. The number of times an attempt to reuse information in the SQL/MX buffer was unsuccessful.
• Parallelism among a large number of ESPs • Running plans that use sort and grouping operators • The optimization of complex plans The heavy consumption of addressable memory space by SQL/MX processes can lead to insufficient swap-file space. Therefore, you should provide more kernel-managed swapped space on each CPU by increasing the size of existing swap files or by adding new swap files.
clause is omitted from the CREATE TABLE statement, table creation defaults to the volume from which NonStop SQL/MX was started, which could be an unaudited volume. Some tools, like MicroStrategy, allow you to set a preference to append information to an SQL statement. You could use this capability to add the location clause. Not all tools allow this use. However, you can use this workaround: Add a default location through MXCS: Name: _DEFAULTS Value: class defaults, volume $.
MDAM MDAM Query Technology Design Considerations To exploit the capabilities of MDAM query technology, first you must know the data distributions and unique values for each key column in the table. Generally, you should order key columns to provide the most efficient and direct access into the table, based on the columns most frequently used by the queries. If you have a choice in determining the order of the key columns, order them by their unique value count, from least to greatest.
Possible Implementations Join Efficiency Primary-key order can have a large impact on join performance. A typical example involves a frequently used dimension table that is joined to a fact table. Assume that a predicate is supplied on a column of the dimension table. If the join column in the fact table is the first key column, the SQL/MX optimizer may read the dimension table, select the rows that match the predicate, and then join directly to the fact table.
Partitioning of tables and indexes is required if you want to take advantage of parallelism. HP recommends that you limit the number of partitions used, because processes perform better when they access fewer drives. This arrangement is different from the traditional SQL/MP method of database design. Normally, distributing partitions across many drives increases performance. Fact Table Partitioning Techniques Fact tables are often very large and, therefore, must be partitioned across multiple disk volumes.
Sequential Range Partitioning Sequential partitioning means that the partitioning and clustering key values of all subsequent data added to a fact table are always greater than the partitioning and clustering key values in the existing data. That is, the partition range is continually increasing. For example, sequential partitioning occurs in a table that uses a date for the partitioning column.
Hash partitioning uses a hashing key to randomly distribute the data. New rows are inserted within existing cluster key order within the volume, causing table fragmentation as existing blocks are split to make room for newly added rows. Online reloads are required periodically to defragment all partitions and maintain adequate scan performance. Note that all partitions become fragmented as new data is inserted into the table across the entire partitioning range.
time, starting with the same partition-number value. This occurs only when the fact table is not the first outer table of the query. Other Benefits of Partitioning The partition table has other benefits. If the partition table is selected as the outer table of the join, the partition table determines the number of ESPs the query uses.
3. Determine the effect of the UPDATE STATISTICS statement, and, optionally, back out the generated histogram, if necessary. a. In MXCI, back up current histogram tables, if any: • CREATE TABLE myhist LIKE HISTOGRAMS; • INSERT INTO myhist SELECT * FROM HISTOGRAMS; • CREATE TABLE myhistint LIKE HISTOGRAM_INTERVALS; • INSERT INTO myhistint SELECT * FROM HISTOGRAM_INTERVALS; b. Issue the UPDATE STATISTICS command for the required column groups. c. Recompile the query. d.
To see if cache is operating efficiently, use the STAT option of the PUP LISTCACHE command. If CACHE READ HITS are less than 90 percent, consider increasing the cache size. If the ratio of CACHE FAULTS to CACHE CALLS is greater than one percent, consider reducing the cache size, adding more physical memory to the CPU, or processing to other CPUs. Cache size is controlled through the Subsystem Control Facility (SCF). For more information about using SCF to set cache size, see the SCF documentation.
• Avoid the excessive use of CASE statements, data type casting, and function calls within a SQL statement if these features can be performed within the logic of a program. Often compiled application logic can perform these features more efficiently. Managing and Monitoring Catalog and Schema Placement and Maintenance Maintain separate catalog and schema for test and production purposes. Drop catalog and schema entries when they are no longer required.
Data-loading Techniques and Considerations This section suggests procedures to load data from a flat file to SQL/MX tables. Your procedures may vary depending on your circumstances. Create the Table for the Final Destination of the SQL/MX Table SQL/MX tables cannot be renamed; if you want to partition SQL/MX tables, partition them initially. Do not partition smaller tables that may be best served by a single drive’s cache. Do not create the table along with its indexes.
for data load are IMPORT, the DataLoader/MX programming API, custom programming, or third-party ETL tools. Getting Guardian Names from the System Catalog for FUP RELOAD Note that definition_schema_version_1200 is correct for now; however, in the future, there will be version numbers higher than 1200. set schema definition_schema_version_1200; select 'fup reload ', substring(p.system_name, 1, 15) , '.' , rtrim(p.data_source) , '.' , rtrim(p.file_suffix) , ', rate 40' from partitions p, where o.
block is reserved for future inserts. Because most fact tables are loaded once and seldom updated, this free space is never used. Furthermore, each I/O reads less data. Setting the Slack to 0 ensures that all available space is used for data. This parameter is specified in the FUP RELOAD utility. Reorganizing Tables For a database table to deliver good query scan performance, it must be reorganized periodically.
appear again for processing. You may want to place the row forward into the selection path, but be aware that doing so can cause erratic cursor behavior and is hard to trace. nohup (no hangup) is the OSS option that bests represents the Guardian NOWAIT option. For more information about this option, seethe Open System Services Shell and Utilities Reference Manual. To find out if the file is full, issue the SHOWLABEL tablename command to obtain file extent information and the table-percentage full.
Updating Statistics Statistics are the single most important topic in database management. All tables, even if they are empty, need to have their statistics generated. Statistics influence almost every aspect of the query plans to be generated by the optimizer. You cannot update statistics on system metadata tables, including tables residing in the DEFINITION_SCHEMA, MXCS_SCHEMA, SYSTEM_DEFAULTS_SCHEMA, and SYSTEM_SCHEMA.
This example runs the UPDATE STATISTICS statement on a large table: UPDATE STATISTICS FOR TABLE tablename SAMPLE RANDOM 10 PERCENT SET ROWCOUNT rowcount; Using the UPDATE STATISTICS statement is a processor-intensive operation. To limit the impact on the system, execute this statement with a low priority so the system can manage the mixed workloads. By default, UPDATE STATISTICS ON EVERY KEY gathers statistics for all key columns, including indexed columns.
Publish and Subscribe This section compares the use of publish and subscribe against the use of queue files. • SKIP CONFLICT ACCESS is available through publish and subscribe, but not through queue files, which is a performance advantage that supports scalability and avoids hotspots. • Queue files must be accessed through Enscribe, but publish and subscribe can be accessed through ODBC, JDBC, and embedded SQL, and is also supported through JMS.
Implementing an ODBC or JDBC Architecture For a detailed discussion of various considerations for managing the ODBC and JDBC environment, refer to the SQL/MX Installation and Management Guide. Hardware and software requirements for the JDBC Driver for NonStop SQL/MX are described in the softdoc file on the NonStop Server for Java product CD, with which JDBC/MX is delivered. Read that document before installing the product.
Performance Tuning Improving Performance This section discusses some of the common performance problems associated with large databases. Additional information exists in other manuals and is not repeated here. Specifically, the SQL/MX Query Guide contains several chapters dedicated to analyzing and improving query and database performance, and the SQL/MX Installation and Management Guide contains chapters on measuring and enhancing performance.
The optimizer chooses specific access plans in subsequent steps of the query based upon the number of rows expected from earlier steps. When the difference between the predicted and actual values is as great as several million, the subsequent query steps often cannot deliver adequate performance. In cases of severe data skewing, the optimizer has little chance of producing a fair plan.
Queries that frequently read large tables and aggregate the results to a high level will benefit from the creation of summary table or tables. The “Design guidelines and considerations” section of this document includes additional considerations for summary tables. Joins between large tables can be quite expensive, in terms of both processing time and resources. This issue is often one of the most significant issues faced by customers in the database environment.
Analysis Methodology Methodically follow these steps as a guide for improving query performance: • Identify several key queries suffering poor performance, and use them as a test suite to evaluate the effectiveness as the changes discussed later in this document are made. Obtain current DISPLAY_EXPLAIN output for each query. • For ODBC Server clients, when predicates for date columns are present in the query, replace them with parameters (“?p”), to mimic the actions of the ODBC Server subsystem.
forced. You can use the DISPLAY USE OF command for programs, but that command does not list MXCI, ODBC, or other dynamic queries that may be forced. Be careful if you use CONTROL QUERY SHAPE because doing so could cause more problems in the long run for database maintenance and performance. Using CONTROL QUERY SHAPE forces the same query even if the database changes. If the database changes, a more efficient query may be available through normal use of the compiler without using CONTROL QUERY SHAPE.
Contributors to this Document • Doug Tully. Provided general information about SQL/MX environments. Acting coordinator of the Best Practices content providers for NonStop SQL/MX. • Paul Denzinger. Provided information about ways to partition tables and general information about SQL data warehouses. Acting coordinator of the Best Practices content providers for NonStop SQL/MP. • Gary Grosch. Author of the “Solution Factory Data Specification” document referenced here.1 • Todd Gunn.
This and certain other documents referenced in the current document are not yet posted on the Web. Some are available through the HP NonStop Technical Library (NTL) product. If the author’s name is listed, please contact the author at First_name.Last_Name@HP.com for more information.