VAX 6000 Series Vector Processor Programmer’s Guide Order Number: EK–60VAA–PG–001 This manual is intended for system and application programmers writing programs for the VAX 6000 system with a vector processor.
First Printing, June 1990 The information in this document is subject to change without notice and should not be construed as a commitment by Digital Equipment Corporation. Digital Equipment Corporation assumes no responsibility for any errors that may appear in this document. Any software described in this document is furnished under a license and may be used or copied only in accordance with the terms of such license.
Contents PREFACE CHAPTER 1 VECTOR PROCESSING CONCEPTS ix 1–1 1.1 SCALAR VS. VECTOR PROCESSING 1.1.1 Vector Processor Defined 1.1.2 Vector Operations 1.1.3 Vector Processor Advantages 1–2 1–3 1–3 1–6 1.2 TYPES OF VECTOR PROCESSORS 1.2.1 Attached vs. Integrated Vector Processors 1.2.2 Memory vs. Register Integrated Vector Processors 1–6 1–6 1–8 1.3 VECTORIZING COMPILERS 1–8 1.4 VECTOR REGISTERS 1–9 1.5 PIPELINING 1–11 1.6 STRIPMINING 1–14 1.7 STRIDE 1–15 1.
Contents 1.10.3 Crossover Point CHAPTER 2 VAX 6000 SERIES VECTOR PROCESSOR 1–22 2–1 2.1 OVERVIEW 2–2 2.2 BLOCK DIAGRAM 2–3 2.3 VECTOR CONTROL UNIT 2–5 2.4 ARITHMETIC UNIT 2.4.1 Vector Register File Chip 2.4.2 Vector Floating-Point Unit Chip 2–5 2–6 2–7 2.5 LOAD/STORE UNIT 2–7 2.6 VECTOR PROCESSOR REGISTERS 2–9 2.7 MEMORY MANAGEMENT 2.7.1 Translation-Not-Valid Fault 2.7.2 Modify Flows 2.7.3 Memory Management Fault Priorities 2.7.4 Address Space Translation 2.7.
Contents CHAPTER 3 OPTIMIZING WITH MACRO-32 3–1 3.1 VECTORIZATION 3.1.1 Using Vectorization Alone 3.1.2 Combining Decomposition with Vectorization 3.1.3 Algorithms 3–2 3–2 3–3 3–5 3.2 CROSSOVER POINT 3–5 3.3 SCALAR/VECTOR SYNCHRONIZATION 3–6 3.3.1 Scalar/Vector Instruction Synchronization (SYNC) 3–6 3.3.2 Scalar/Vector Memory Synchronization 3–7 3.3.2.1 Memory Instruction Synchronization (MSYNC) • 3–8 3.3.2.2 Memory Activity Completion Synchronization (VMAC) • 3–9 3.3.
Contents 3.10 REGISTER REUSE APPENDIX A ALGORITHM OPTIMIZATION EXAMPLES 3–25 A–1 A.1 EQUATION SOLVERS A–2 A.2 SIGNAL PROCESSING—FAST FOURIER TRANSFORMS A.2.1 Optimized One-Dimensional Fast Fourier Transforms A.2.
Contents FIGURES 1–1 Scalar vs. Vector Processing 1–2 Vector Registers 1–3 Vector Function Units 1–11 1–4 Pipelining a Process 1–12 1–5 Constant-Strided Vectors in Memory 1–16 1–6 Random-Strided Vectors in Memory 1–16 1–7 Vector Gather and Scatter Instructions 1–17 1–8 Computer Performance Dominated by Slowest Process 1–20 1–9 Computer Performance vs.
Contents TABLES 2–1 Memory Management Fault Prioritization 3–1 Qualifier Combinations for Parallel Vector Processing viii 2–12 3–3
Preface Intended Audience This manual is for the system or application programmer of a VAX 6000 system with a vector processor. Document Structure This manual has three chapters and an appendix, as follows: • Chapter 1, Vector Processing Concepts, describes basic vector concepts and how vector processing differs from scalar processing. • Chapter 2, VAX 6000 Series Vector Processor, gives an overview of the vector coprocessor and related vector features.
Preface VAX 6000 Series Documents Documents in the VAX 6000 series documentation set include: Title Order Number VAX 6000–400 Installation Guide EK–640EA–IN VAX 6000–400 Owner’s Manual EK–640EA–OM VAX 6000–400 Mini-Reference EK–640EA–HR VAX 6000–400 System Technical User’s Guide EK–640EB–TM VAX 6000–400 Options and Maintenance EK–640EB–MG VAX 6000 Series Upgrade Manual EK–600EB–UP VAX 6000 Series Vector Processor Owner’s Manual EK–60VAA–OM VAX 6000 Series Vector Processor Programmer’s Guide
Preface Title Order Number SC008 Star Coupler User’s Guide EK–SC008–UG TK70 Streaming Tape Drive Owner’s Manual EK–OTK70–OM TU81/TA81 and TU81 PLUS Subsystem User’s Guide EK–TUA81–UG ULTRIX–32 Guide to System Exercisers AA–KS95B–TE VAX Architecture Reference Manual EY–3459E–DP VAX FORTRAN Performance Guide AA–PB75A–TE VAX Systems Hardware Handbook — VAXBI Systems EB–31692–46 VAX Vector Processing Handbook EC–H0419–46 VAXBI Expander Cabinet Installation Guide EK–VBIEA–IN VAXBI Options Ha
1 Vector Processing Concepts This chapter presents a brief overview of vector processing concepts. Sections include: • Scalar vs.
Vector Processing Concepts 1.1 SCALAR VS. VECTOR PROCESSING Vector processing is a way to increase computer performance over that of a general-purpose computer for certain scientific applications. These include image processing, weather forecasting, and other applications that involve repeated operations on groups, or arrays, of elements. A vector processor is a computer optimized to execute the same instruction repeatedly. For example, consider the process of adding 50 to a set of 100 numbers.
Vector Processing Concepts 1.1.1 Vector Processor Defined A vector processor is a computer that operates on an entire vector with a single vector instruction. These types of computers are known as single-instruction/multiple-data (SIMD) computers because a single vector instruction can process a stream of data. A traditional scalar computer typically operates only on scalar values so it must therefore process vectors sequentially.
Vector Processing Concepts A scalar processor operates on single quantities of data. Consider the following operation: A(I) = B(I) + 50. As illustrated in Figure 1–1, five separate instructions must be performed, using one instruction per unit of time, for each value from 1 to Imax (some CPUs may combine steps and use fewer units of time): 1 Load first element from location B. 2 Add 50 to the first element. 3 Store the result in location A. 4 Increment the counter.
Vector Processing Concepts Figure 1–1 Scalar vs.
Vector Processing Concepts 1.1.3 Vector Processor Advantages Vector processors have the following advantages: 1.2 • A vector processor can use the full bandwidth of the memory system for loading and storing an array. Unlike the scalar processor, which accepts single values at a time from memory, the vector processor accepts any number up to its limit, say 64 elements, at a time. The vector processor processes all the elements together and returns them to memory.
Vector Processing Concepts I/O device, controlled under program direction through special registers and operating asynchronously from the host. Program data is moved back and forth between the attached processor and the host with standard I/O operations. The host processor requires no special internal hardware to use an attached vector processor. There is no "pairing" of a host processor to an attached vector processor. A system can have multiple host scalar processors and one attached vector processor.
Vector Processing Concepts ability to overlap vector and scalar operations can give better performance than those that do not. 1.2.2 Memory vs. Register Integrated Vector Processors There are two types of integrated vector processor architectures: memoryto-memory and register-to-register. In a memory-to-memory architecture, vector data is fetched directly from memory into the function units of the vector processing unit. Once the data is operated on, the results are returned directly to memory.
Vector Processing Concepts for which vectorization will yield correct results and generate vector instructions for those operations. If the compiler cannot be certain that a particular expression can be correctly vectorized, it will not vectorize that portion of the source code. The vectorizing compiler can reorganize sections of the program code (usually inside formal loops) that can be vectorized. Certain portions of all applications are nonvectorizable.
Vector Processing Concepts Figure 1–2 Vector Registers 64 ELEMENTS PER REGISTER 64 BITS PER ELEMENT VLR VECTOR MASK REGISTER 0 VMR ENABLES/ DISABLES ELEMENT USE VECTOR LENGTH REGISTER: CONTROLS NUMBER OF ELEMENTS USED 63 <63:0> 1–10 msb-0421-90
Vector Processing Concepts 1.5 PIPELINING A vector function unit, or pipe, performs a specific function within a vector processor and operates independently of other function units. Some vector processors have three function units (see Figure 1–3): one for memory instructions, one for operations (such as add, subtract, and multiply), and one for miscellaneous instructions. Some vector processors have additional function units, specifically to perform additional arithmetic functions.
Vector Processing Concepts Figure 1–4 shows a concurrent execution of a process divided into four subprocesses. Each subprocess has five operations (A, B, C, D, and E), which start at different times. Operation A might be a load, operation B might be an add, ... and operation E might be a store. There is some overhead in starting the process, but once that overhead (or pipeline latency) is paid, the function unit produces one result per cycle.
Vector Processing Concepts Because most arithmetic and memory operations can be broken down into a series of one-cycle steps, the function units of a vector processor are generally pipelined. Thus, after initial pipeline latency, the function units can process an entire vector in the number of cycles equal to the length of the input vector—one vector element result per cycle. This time interval (known as a chime) is approximately equal (in cycles) to the length of the vector plus the pipeline latency.
Vector Processing Concepts 1.6 STRIPMINING An array longer than the maximum vector register length of a vector processor must be split into two or more subarrays or vectors, each of which can fit into a vector register. This procedure is known as stripmining (or sectioning), and it is performed automatically by a vectorizing compiler when the source program operates on loops longer than the maximum vector length.
Vector Processing Concepts 1.7 STRIDE To a vector processor, a vector in memory is characterized by a start location, a length, and a stride. Stride is a measure of the number of memory locations between consecutive elements of a vector. A stride equal to the size of one vector element, known as unity stride, means that the vector elements are contiguous in memory.
Vector Processing Concepts Figure 1–5 Constant-Strided Vectors in Memory A 1 2 3 4 5 4 5 STRIDE = 1 B 1 2 3 6 7 8 9 10 STRIDE = 2 msb-0424-90 Figure 1–6 Random-Strided Vectors in Memory BASE + OFFSET 0 BASE + OFFSET 1 1 BASE 1–16 BASE + OFFSET 2 2 STRIDE = RANDOM 3 msb-0425-90
Vector Processing Concepts 1.8 GATHER AND SCATTER INSTRUCTIONS To support random-strided vectors, gather and scatter instructions operate under control of a vector register that contains an index vector. For each element in the vector, the corresponding element in the index vector contains an offset from the start location of the vector in memory.
Vector Processing Concepts 1.9 COMBINING VECTOR OPERATIONS TO IMPROVE EFFICIENCY Some of the techniques available to increase vector instruction efficiency include overlapping and chaining. 1.9.1 Instruction Overlap Overlapping instructions involves combining two or more instructions to overlap their execution to save execution time. If a vector processor has independent function units, it can perform different operations on different operands simultaneously.
Vector Processing Concepts 1.10 PERFORMANCE The performance of scalar computers has been measured for some time using millions of instructions executed per second (MIPS). MIPS is not a good measure of speed for vector processors, since one instruction produces many results. Vector processor execution speed, instead, is measured in millions of floating-point operations per second (MFLOPS). Other abbreviations used are MegaFLOPS, MOPS (millions of operations per second), and RPM (results per microsecond).
Vector Processing Concepts Figure 1–8 Computer Performance Dominated by Slowest Process 1.0 .9 SCALAR OPERATIONS = 30% .8 VECTOR OPERATIONS = 70% OF PROGRAM .7 .6 .5 .4 .3 .2 TIME .
Vector Processing Concepts 1.10.2 Vectorization Factor Some computer programs use only a portion of the code during the majority of the execution time. For example, a program may spend most of its time doing mathematical calculations, which comprise only 20% of its code. The vectorization factor may be defined as the percentage of the original scalar execution time that may be vectorized. Figure 1–9 shows how the performance increases as more code is vectorized. Figure 1–9 Computer Performance vs.
Vector Processing Concepts Consider a scalar program that uses 20% of its code 70% of the time. If this 20% portion of the code is converted to vector processing code, the program is considered to have a vectorization factor of 70%. If the time for the scalar operation is set to 1 and the time for a vector operation is 10%, we have: T = N * (.30 * 1 + .70 * .1) T = N * .37 If performance (P), equals operations performed (N) per unit time (T) then, with T = N * .37: P = N / T = N / (N * .
2 VAX 6000 Series Vector Processor This chapter describes the vector processor module for the VAX 6000 series. The basic hardware is briefly described and then the hardware components are discussed from the software perspective.
VAX 6000 Series Vector Processor 2.1 OVERVIEW The FV64A vector processor is a single-board option that implements the VAX vector instruction set. This module requires a scalar CPU module for operation. The scalar/vector pair implement the VAX instruction set plus the VAX vector instructions. Figure 2–1 is a block diagram of the scalar/vector pair. The vector processor occupies a slot adjacent to the scalar CPU on the XMI. The two processors are connected by the vector interface bus (VIB) cable.
VAX 6000 Series Vector Processor Figure 2–1 Scalar/Vector Pair Block Diagram SCALAR PROCESSOR VECTOR PROCESSOR VIB Cable CPU-Chip Vector Control Unit VECTL Chip C-Chip Cache Data Bus (CD Bus) DAL Arithmetic Pipelines Cache F-Chip Duplicate Tag Store Load/Store and XMI interface XMI Interface XMI Bus msb-0528-90 2.
VAX 6000 Series Vector Processor The FV64A chipset consists of five core chips, as follows: • Vector instruction issue and scalar/vector interface chip • Vector register file chip, 4 chips • Vector arithmetic data path, floating-point unit (FPU) chip, 4 chips • Load/Store—Vector module translation buffer, cache, and XMI interface controller chip • Clock generation chip (same as on scalar module) Figure 2–2 FV64A Vector Processor Block Diagram To Scalar Processor VIB Cable VECTOR CONTROL UNIT V
VAX 6000 Series Vector Processor 2.3 VECTOR CONTROL UNIT When the vector control unit receives instructions, it buffers the instructions and controls instruction issue to other functional units in the vector module. The vector control unit is responsible for all scalar/vector communication. The vector control unit also contains the necessary register scoreboarding to control instruction overlap. The scoreboard implements the algorithms that permit chaining of arithmetic operations into store operations.
VAX 6000 Series Vector Processor operation to take place. The data from the register file chip flows to the vector FPU chip. Input data to the vector FPU chip comes over a 32-bit bus that is driven twice per cycle, and results are returned on a separate 32-bit bus that is driven once per cycle. The two operands for single-precision instructions can be passed in one cycle, while double-precision operands require two cycles.
VAX 6000 Series Vector Processor 2.4.2 Vector Floating-Point Unit Chip The FPU chip is a multi-stage pipelined floating-point processor. Among its features are: 2.5 • VAX vector floating-point instructions and data types. The FPU implements instruction and data type support for all VAX vector floating-point instructions as well as the integer multiply operation. Floating-point data types F_, D_, and G_floating are supported. • High-throughput external interface.
VAX 6000 Series Vector Processor corresponding register file address is presented to the four register file chips. The data and addresses are automatically aligned for load and store operations to permit the correct reading and writing of the register file and cache data RAMs. Up to four cache misses can be outstanding before the read data for the first miss returns, and hits can be taken under misses.
VAX 6000 Series Vector Processor 2.6 VECTOR PROCESSOR REGISTERS The vector processor has 16 data registers, each containing 64 elements numbered 0 through 63. Each element is 64 bits wide. A vector instruction that reads or writes longwords of F_floating or integer data reads bits <31:0> of each source element and writes bits <31:0> of each destination element. Other registers used with the data registers are the Vector Length, Vector Count, and Vector Mask Registers (see Figure 2–3).
VAX 6000 Series Vector Processor Figure 2–3 Vector Count, Vector Length, Vector Mask, and Vector Registers 6 0 :VC Vector Count 6 0 :VL Vector Length 31 0 :VML :VMH 63 32 Vector Mask 63 0 :Vn[0:63] Quadword Vector Registers msb-0530-90 2–10
VAX 6000 Series Vector Processor 2.7 MEMORY MANAGEMENT The vector processor implements memory management as described in the VAX Architecture Reference Manual. The 32-bit virtual address is partitioned as shown in Figure 2–4. Figure 2–4 Virtual Address Format 31 30 29 9 Virtual Page Number 8 0 Byte in Page Access mode 0,0 = P0 Space 0,1 = P1 Space 1,0 = S Space 1,1 = Reserved (virtual address causes length violation) msb-0531-90 2.7.
VAX 6000 Series Vector Processor 2.7.3 Memory Management Fault Priorities Table 2–1 shows the priority order, from highest to lowest, by which the vector processor reports faults. Table 2–1 Memory Management Fault Prioritization 2.7.
VAX 6000 Series Vector Processor 2.8 CACHE MEMORY The vector module implements a single-level, direct-mapped cache. In addition, the load/store unit can hold the data and addresses for one complete vector store or scatter instruction. Figure 2–5 shows the flow of address and data in the load/store pipeline. Each stage is a single or multiple stage based on the 44.44-ns vector module clock. The XMI stage is a multiple of 64 ns, and the time taken depends on the transaction type and the XMI bus activity.
VAX 6000 Series Vector Processor Associated with each of the 32K main tags is a duplicate tag in the XMI interface. This tag is allocated in parallel with the main tag and is used for determining invalidates. All XMI write command/address cycles are compared with the duplicate tag data to determine if an invalidate should take place. The resulting invalidate is placed in a queue for subsequent processing in the main tag store. Figure 2–6 shows the cache arrangement.
VAX 6000 Series Vector Processor Figure 2–8 shows how the main tag memory is arranged. The main tag is written with PA<28:20>, and the valid bit covers a hexword block. The parity bit covers the tag and valid bits. The duplicate tag memory is identical to the main tag memory. Figure 2–9 shows the organization of the cache data. Each cache block contains four quadwords, with eight longword parity bits.
VAX 6000 Series Vector Processor 2.8.2 Cache Coherency All data cached by a processor must remain coherent with data in main memory. This means that any write done by a processor or I/O device must displace data cached by all processors. The XMI interface in the load/store unit continuously monitors all XMI write transactions. When a write is detected, the address is compared with the contents of a duplicate tag store to determine if the write should displace data in the main cache.
VAX 6000 Series Vector Processor 2.9 VECTOR PIPELINING The vector processor, which is fully pipelined, has three major function units: the vector issue unit, the load/store unit, and the arithmetic unit (see Figure 2–10). These function units operate independently and are fully pipelined. Vector instructions are received by the issue unit from the scalar processor.
VAX 6000 Series Vector Processor The availability of registers is handled by a method called scoreboarding. The rules governing register availability depend on the type of instruction to be issued. • For a load instruction, the register to be loaded must not be modified by any currently executing (arithmetic) instruction, and it must not be modified or used as input by any currently deferred (arithmetic) instruction.
VAX 6000 Series Vector Processor Once a load or store (or scatter or gather) instruction is issued, no further instructions may be issued until a Memory Management Okay (MMOK) is received. The scalar unit is also stalled until the MMOK is received. Chapter 3 suggests certain coding techniques to minimize the impact of this behavior as much as possible. 2.9.3 Arithmetic Unit The arithmetic unit is composed of two parts: the vector ALU and the vector FPU.
VAX 6000 Series Vector Processor Figure 2–11 Vector Arithmetic Unit 1 cycle Integer Instruction Conversion 1 cycle FPU ... FPU ... Result Conversion Logical or Merge Instruction msb-0585-90 An instruction continues executing until all results are completed.
VAX 6000 Series Vector Processor 2.10 INSTRUCTION EXECUTION The vector pipeline is made up of a varying number of segments depending on the type of instruction being executed. Once an instruction is issued, the pipeline is under the control of the load/store unit or the arithmetic unit. The interaction between the different function units of the vector module can greatly affect the performance/execution of vector instructions.
3 Optimizing with MACRO-32 This chapter discusses optimization features of the VAX 6000 series vector processor. Appendix A provides additional optimization examples.
Optimizing with MACRO-32 3.1 VECTORIZATION Many loops that VAX FORTRAN decomposes can also be vectorized. VAX FORTRAN performs vectorization automatically whenever /VECTOR is specified at compilation. VAX FORTRAN can vectorize any FORTRAN–77 Standard-conforming program; and vectorized programs can freely call and be called by other programs (vectorized, not vectorized, and nonFORTRAN) as long as both abide by the VAX Calling Standard.
Optimizing with MACRO-32 3.1.2 – Combine vectorization with decomposition. – Consider a solution at a higher level hierarchy. – Retest the program compiling with /VECTOR (or any combination with a parallel qualifier) and return to the start of step 3. Combining Decomposition with Vectorization To produce code that executes in parallel and on vector processors, compile /VECTOR with a parallel qualifier. Table 3–1 lists the compilation combinations and their interpretations.
Optimizing with MACRO-32 aggregate speedup of each because of these qualities; however, both CPU time and wall-clock time can be reduced most dramatically when vector and parallel processing are combined. The qualities involved are as follows: • Large amounts of vector load-stores can create a bottleneck in the system. On the other hand, small amounts of CPU work can cause the parallel processing startup overhead itself to become a bottleneck.
Optimizing with MACRO-32 3.1.3 Algorithms At times it is necessary to consider the algorithm that is represented by the code to be optimized. Some algorithms are not as well suited to vectorization as others. It may be more effective to change the algorithm used or the way it is implemented rather than trying to optimize the existing code.
Optimizing with MACRO-32 3.3 SCALAR/VECTOR SYNCHRONIZATION For most cases, it is desirable for a vector processor to operate concurrently with the scalar processor so as to achieve best performance. However, there are cases where the operation of the vector and scalar processors must be synchronized to ensure correct program results.
Optimizing with MACRO-32 serviced, the SYNC may be returned to through a Return from Exception or Interrupt (REI) instruction. SYNC only affects the scalar/vector processor pair that executed it. It has no effect on other processors in a multiprocessor system. 3.3.
Optimizing with MACRO-32 Scalar/vector memory synchronization does not mean that previously issued vector memory instructions have completed; it only means that the vector and scalar processor are no longer performing memory operations. While both VMAC and MSYNC provide scalar/vector memory synchronization, MSYNC performs significantly more than just that function. In addition, VMAC and MSYNC differ in their exception behavior.
Optimizing with MACRO-32 fault. After the fault has been serviced, the MSYNC may be returned to through a Return from Exception or Interrupt (REI) instruction. 3.3.2.2 Memory Activity Completion Synchronization (VMAC) Privileged software needs a way to ensure scalar/vector memory synchronization that will not result in any exceptions being reported.
Optimizing with MACRO-32 upon) the completion of all prior conflicting accesses of that location by previous vector memory instructions. VSYNC does not have any synchronizing effect between scalar and vector memory access instructions. VSYNC also has no synchronizing effect between vector load instructions because multiple load accesses cannot conflict. It also does not ensure that previous vector memory management exceptions are reported to the scalar processor. 3.3.
Optimizing with MACRO-32 3.3.4.2 Precise Exceptions The vector processor produces precise exceptions for memory management faults. When a memory management exception occurs, microcode and operating system handlers are used to fix the exception. The vector processor cannot service Translation Not Valid and AccessControl Violation faults. To handle these exceptions, the vector processor passes sufficient state data back to the scalar CPU.
Optimizing with MACRO-32 startup latency for the second arithmetic instruction (deferred arithemetic instruction) is a benefit in algorithms that require less than eight Bytes /FLOP of load/store bandwidth. Typical algorithms benefit greatly from the ability to chain an arithmetic operation into a store operation. The vector control unit, along with the ALU unit, implements this capability. The following sections describe by instruction type the flow of instructions in the machine. 3.4.
Optimizing with MACRO-32 In the load/store instruction, the Vector Length Register (VLR) and the Vector Mask Register (VMR) with the match true/false (T/F) (when the mask operate enable (MOE) bit is set) determine which elements to access in Vc. For longwords, only bits <31:0> may be accessed. The elements can be loaded or stored out of order, because there can be multiple load /store units and multiple paths to memory, a desirable effect of vector processors.
Optimizing with MACRO-32 The data from a store instruction is internally buffered in the chip. This offers the advantage of allowing cache hit load instructions to issue and complete while the write executes over the XMI. 3.4.3 Memory Management Okay (MMOK) When a memory reference occurs, control is turned over to memory management until an MMOK is returned indicating that all memory references were successful.
Optimizing with MACRO-32 When a gather or scatter instruction is received by the vector control unit, the destination/source register is checked against outstanding arithmetic instructions. If there are no conflicts, the instruction is dispatched to the load/store unit. The load/store unit will then fetch the offset vector register.
Optimizing with MACRO-32 In the following examples, an I represents instruction issue time and an E represents instruction execution time. A series of periods represents wait time in the arithmetic unit for deferred instructions. Notice that these are not exact timing examples, since they do not correspond to individual instruction timings, but are for illustration purposes only.
Optimizing with MACRO-32 Example 3–2 Maximizing Instruction Execution Overlap Instruction Sequence 1 VLDL VLDL VVMULL VVADDL VSTL base1,#4,V1 base2,#4,V2 V3,V1,V1 V1,V2,V2 V2,base,#4 IEEEEEEEEE IEEEEEEEEE IEEEEE I....EEEEE IEEEEEEEEE Instruction Sequence 2 VLDL VVMULL VLDL VVADDL VSTL base1,#4,V1 V3,V1,V1 base2,#4,V2 V1,V2,V2 V2,base,#4 IEEEEEEEEE IEEEEE IEEEEEEEEE IEEEEE IEEEEEEEEE A load instruction cannot begin execution until the register to which it will write is free.
Optimizing with MACRO-32 Example 3–3 Effects of Register Conflict Instruction Sequence 1 VVADDL VVMULL VLDL V1,V2,V3 V1,V2,V4 base,#4,V1 IEEEEE I....EEEEE IEEEEEEEEE Instruction Sequence 2 VVADDL VVMULL VLDL V1,V2,V3 V1,V2,V4 base,#4,V5 IEEEEE I....EEEEE IEEEEEEEEE Nonunity stride loads and stores can have a significantly higher impact on the performance level of the XMI memory bus as compared to unity stride operations.
Optimizing with MACRO-32 Example 3–4 Deferred Arithmetic Instruction Queue Instruction Sequence VVADDL VVMULL VLDL V1,V2,V3 V3,V1,V4 base,#4,V2 Execution without Deferred Instruction Queue Issue VVADDL Issue VVMULL Issue VLDL IEEEEEEEE IEEEEEEEE IEEEEEEEEEEEEEE Execution with Deferred Instruction Queue Issue VVADDL Issue deferred VVMULL Issue VLDL IEEEEEEEE I.......
Optimizing with MACRO-32 Example 3–6 is another example of the use of a deferred arithmetic instruction. In this case, a divide instruction is followed by an add and then a load. The deferred instruction queue and the length of the divide instruction combine to "hide" the load instruction (that is, the execution time of the load instruction does not contribute to the total execution time of the instruction sequence). Note also that the divide instruction completes after the load completes.
Optimizing with MACRO-32 In Example 3–7, the VSTL instruction requires the result of the VVADDL instruction and without chain into store would have to wait for the VVADDL to complete before beginning the store operation. The use of chain into store allows the VSTL operation to begin after the first result of the add is complete, while the VVADDL is still executing and greater overlap of instruction execution is the result. The instruction sequence requires a shorter period of time to complete.
Optimizing with MACRO-32 A cache miss is serviced by a hexword fill. On the XMI, a hexword transfer is 80 percent efficient since one address is sent to receive four quadwords of data. An octaword transfer is 67 percent efficient since one address is sent to receive two quadwords of data. A quadword transfer is only 50 percent efficient since one address is sent to receive one quadword of data. For this reason, stores are more efficient with unity stride than with nonunity stride.
Optimizing with MACRO-32 most efficient in that it runs sequentially through the data and makes full use of all PTEs fetched. An example of how to avoid large vector strides can be seen in a simple matrix multiplication problem: DO I = 1, N DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J) ENDDO ENDDO ENDDO If coded as written, there is a choice of which variable to vectorize on. If the "K" variable is chosen, array A will access FORTRAN rows that are nonunity stride.
Optimizing with MACRO-32 Example 3–8 Matrix Multiply—Basic msync R0 ;synchronize with scalar LOOP: vldl A(I,K),#4,V0 vsmulf B(K,J),V0,V0 vldl C(I,J),#4,V1 vvaddf V0,V1,V1 vstl V1,C(I,J),#4 ;col of A is loaded into V0 ;V0 gets the product of V0 ;and the scalar value B(K,J) ;col of C gets loaded into V1 ;V1 gets V0 summed with V1 ;V1 is stored back into C INC K IF (K < N) GOTO LOOP ;increment K by one ;Loop for all values of K INC J IF (J < N) GOTO LOOP ;increment J by vector length ;Loop for all val
Optimizing with MACRO-32 3.10 REGISTER REUSE The concept used in Example 3–9 to reuse the data when it has already been loaded into a vector register is known as register reuse. Register reuse can be extended further by using all available vector registers to decrease the bytes/FLOP ratio and improve performance.
Optimizing with MACRO-32 Example 3–10 Matrix Multiply—Optimal msync R0 mtvlr #N loop2: vldl A(I,K),#4,v0 vsmulf B(K,J),v0,v2 vsmulf vsmulf vsmulf vsmulf vsmulf vsmulf vsmulf vsmulf vsmulf vsmulf vsmulf vsmulf vsmulf B(K,J+1),v0,v3 B(K,J+2),v0,v4 B(K,J+3),v0,v5 B(K,J+4),v0,v6 B(K,J+5),v0,v7 B(K,J+6),v0,v8 B(K,J+7),v0,v9 B(K,J+8),v0,v10 B(K,J+9),v0,v11 B(K,J+10),v0,v12 B(K,J+11),v0,v13 B(K,J+12),v0,v14 B(K,J+13),v0,v15 INC ; ; ; ; ; ; ; ; ; ; ; ; ; ; update ; K loop1: vldl A(I,K),#4,v0 vsmulf vva
Optimizing with MACRO-32 Example 3–10 (Cont.
Optimizing with MACRO-32 Example 3–10 (Cont.
Optimizing with MACRO-32 Example 3–10 (Cont.
A Algorithm Optimization Examples This appendix illustrates how the characteristics of the VAX 6000 series vector processor can be used to build optimized routines for this system and how the algorithm and its implementation can change the performance of an application on the VAX 6000 processor. The VAX 6000 series vector processor delivers high performance for computationally intensive applications.
Algorithm Optimization Examples Two groups of applications that have high vector processing potential include equation solvers and signal processing routines. For example, computational fluid dynamics, finite element analysis, molecular dynamics, circuit simulation, quantum chromodynamics, and economic modeling applications use various types of simultaneous or differential equation solvers.
Algorithm Optimization Examples Example A–1 Core Loop of a BLAS 1 Routine Using Vector-Vector Operations xAXPY - computes Y(I) = Y(I) + aX(I) where x = precision = F, D, G MSYNC ;synchronize with scalar LOOP: VLDx VSMULx X(I),std,VR0 a,VR0,VR0 VLDx VVADDx VSTx Y(I),std,VR1 VR0,VR1,VR1 VR1,Y(I),std INC I IF (I < SIZ) GOTO LOOP MSYNC ;X(I) is loaded into VR0 ;VR0 gets the product of VR0 ;and the scalar value "a" ;Y(I) get loaded into VR1 ;VR1 gets VR0 summed with VR1 ;VR1 is stored back into Y(I) ;inc
Algorithm Optimization Examples Figure A–1 Linpack Performance Graph, Double-Precision BLAS Algorithms Refer to the printed version of this book, EK–60VAA–PG. For the Linpack 300 by 300 benchmark, optimizations include the use of routines that are equivalent to matrix-vector level 2 BLAS routines. Example A–2 details the core loop of a BLAS 2 routine. BLAS 2 routines make better use of cache and translation buffers than the BLAS 1 routines do.
Algorithm Optimization Examples Example A–2 Core Loop of a BLAS 2 Routine Using Matrix-Vector Operations xGEMV - computes Y(I) = Y(I) + X(J)*M(I,J) where x = precision = F, D, G MSYNC ;synchronize with scalar ILOOP: VLDx Y(I),std,VR0 ;Y(I) is loaded as VR0 VLDx VSMULx M(I,J),std,VR1 X(J),VR1,VR2 ;VR1 ;VR2 ;and ;VR0 JLOOP: gets gets X(J) gets columns of M(I,J) the product of VR1 as a scalar VR0 summed with VR2 VVADDx VR0,VR2,VR0 INC J IF (J < SIZ) GOTO JLOOP ;Loop for all values of J VSTx VR0,Y(I
Algorithm Optimization Examples Example A–3 Core Loop of a BLAS 3 Routine Using Matrix-Matrix Operations xGEMM - computes Y(I,J) = Y(I,J) + X(I,K)*M(K,J) here x = precision = F, D, G MSYNC ;synchronize with scalar IJLOOP: VLDx Y(I,J),std,VR0 ;Y(1:N,J) gets loaded into VR0 VLDx VSMULx M(K,J),std,VR1 X(I,K),VR1,VR1 ;K(1:N,K) get loaded into VR1 ;VR1 gets VR1 summed with ;X(I,K) as a scalar ;VR0 gets VR0 summed with VR2 ;increment K by vector length KLOOP: VVADDx VR0,VR2,VR0 INC K IF (K < SIZ) GOTO K
Algorithm Optimization Examples by 1000 obtained a vector speedup of approximately 25, as shown in Figure A–1. A.2 SIGNAL PROCESSING—FAST FOURIER TRANSFORMS The Fourier transform decomposes a waveform, or more generally, a collection of data, into component sine and cosine representation. The discrete Fourier transform (DFT) of a data set of length N performs the transformation following the strict mathematical definition which requires O(N**2) floating-point operations.
Algorithm Optimization Examples Figure A–2 Cooley-Tukey Butterfly Graph, One-Dimensional Fast Fourier Transform for N = 16 Refer to the printed version of this book, EK–60VAA–PG. is performed as close to the center as possible, when the stage number = LOG(N)/2. To complete the algorithm, the butterfly distances now increase again. Second, this process entirely eliminates the need for short butterflies.
Algorithm Optimization Examples Figure A–3 Optimized Cooley-Tukey Butterfly Graph, One-Dimensional Fast Fourier Transform for N = 16 Refer to the printed version of this book, EK–60VAA–PG. Reusing data in the vector registers also saves vector processing time. The VAX vector architecture provides 16 vector registers. If all 16 registers are used carefully, data can be reused by two successive butterfly stages without storing and reloading the data.
Algorithm Optimization Examples Figure A–4 One-Dimensional Fast Fourier Transform Performance Graph, Optimized Single-Precision Complex Transforms Refer to the printed version of this book, EK–60VAA–PG. Figure A–5 Two-Dimensional Fast Fourier Transforms Using N Column and N Row One-Dimensional Fast Fourier Transforms Refer to the printed version of this book, EK–60VAA–PG. access is unity stride and the row access has a stride of the dimension of the array.
Algorithm Optimization Examples For improved performance on VAX vector systems, the use of a matrix transpose can dramatically increase the vector processing performance of two-dimensional FFTs for large values of N (that is, N > 256). The difference between unity stride and nonunity stride is the key performance issue. Figure A–6 shows that a vectorized matrix transpose can be performed after each set of N one-dimensional FFTs.
Algorithm Optimization Examples When the value of N is relatively small (that is, N < 256), the twodimensional FFT can be computed by calling a one-dimensional FFT of length N**2. The small two-dimensional FFT can achieve performance equal to that of the aggregate size one-dimensional FFT by linearizing the data array. Figure A–7 shows the tradeoff between using the linearized two-dimensional routine (for small N) and the transposed method (for large N) to maintain high performance across all data sizes.
Glossary accumulator: A register that accumulates data for arithmetic or logic operations. ALU: Arithmetic Logic Unit, a subset of the operation instruction function unit that performs arithmetic and logical operations, usually in binary form. Amdahl’s Law: A mathematical equation that states that a system is dominated by its slowest process. arithmetic exception: A software error that occurs while performing an arithmetic or floating point operation. array: Elements or data arranged in rows and columns.
Glossary control word operand: The portion of the instruction that indicates which registers to use and enables or disables certain functions. crossover point: The vector length at which the vector processor’s performance exceeds that of the scalar processor. data dependency: The case when data for one arithmetic instruction depends on the result of a previous instruction so both instructions cannot execute at the same time.
Glossary latency: The time that elapses from an element entering a pipeline until the first result is produced. linear recurrence: A cyclic data dependency in a linear function. LINPACK: An industry-wide benchmark used to measure the performance characteristics of various computers. Unlike some benchmarks, LINPACK is a real application package doing linear algebra calculations.
Glossary MFLOPS: See megaflops. MIPS: A measure of the performance of a computer—the rate at which the computer executes instructions. Expressed in terms of "millions of instructions per second." MTF: Match true/false: When masked operations are enabled, only elements for which the Vector Mask Register bit matches true (or false, depending on the condition) are operated upon.
Glossary scalar operand: The symbolic expression representing the scalar data accessed when an operation is executed; for example, the input data or arguement. scatter: The process of storing data into memory starting with a base address plus an offset address; the instruction that performs this action of placing the data back into memory. second-order linear recurrence: Two cyclic data dependencies occurring in a linear function.
Glossary vector instruction: A native computer instruction that recognizes a vector as a native data structure and that can operate on all the elements of a vector concurrently. vector length register (VLR): A 7-bit register that controls the number of elements used in the vector registers, from 0 to 64. vector load: To move data from memory to the vector registers. For example, the VLDx command moves data to the vector registers.
Glossary vector-vector operation: An operation in which each element of one vector register operates with the corresponding element of a second vector register and then places the results in matching elements of a third vector register.
Index A Amdahl’s Law • 1–19, A–3 Arithmetic data path chip • 2–4 instructions • 2–5 unit • 2–5 to 2–18, 2–19, 2–21 Arithmetic pipeline • 3–11 Array • 1–2, 1–6, 1–8, 1–13 Array index • 1–6 Attached vector processor • 1–6, 1–7 B Basic linear algebra subroutines (BLAS) • A–2 BLAS 2 routines • A–4 BLAS 3 routine • A–5 BLAS level 1 • A–2 Bus master • 2–5 C Cache • 1–7, 2–5, 2–7, 2–8, 2–12 to 2–16, 2–18, 3–5, A–4 Cache miss • 3–18, 3–21, 3–22 Chaining • 1–8, 1–18, 2–5 Chain into store • 2–18, 3–16 Chime • 1–13
Index Instruction (Cont.
Index Return from Exception or Interrupt (REI) instruction • 3–7 S Scalar/vector memory synchronization • 3–7 to 3–9 Scalar/vector synchronization • 3–6 Scatter • 1–15, 1–17, 2–19 Scatter instruction • 3–13, 3–14, A–7 Scoreboarding • 2–5, 2–18 Sectioning • 1–14 SIMD • 1–3 Single-precision • 2–6 Speedup ratio • 1–22 Store operation • 3–13 Stride • 1–15, 3–13 Stripmining • 1–14 Subvector • 1–14 Synchronization • 3–6 Synchronize Vector Memory Access (VSYNC) instruction • 3–9 SYNC instruction • 3–6 T Transla