TM AMD Athlon Processor x86 Code Optimization Guide
© 1999 Advanced Micro Devices, Inc. All rights reserved. The contents of this document are provided in connection with Advanced Micro Devices, Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Contents Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv 1 Introduction 1 About this Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 AMD Athlon™ Processor Family. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 AMD Athlon Processor Microarchitecture Summary . . . . . . . . . . . . .
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Switch Statement Usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Optimize Switch Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Use Prototypes for All Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Use Const Type Qualifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Generic Loop Hoisting . . . . . . . . . . . . . . . . . . . . . . . . .
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Use 8-Bit Sign-Extended Displacements. . . . . . . . . . . . . . . . . . . . . . . 39 Code Padding Using Neutral Code Fillers . . . . . . . . . . . . . . . . . . . . . 39 Recommendations for the AMD Athlon Processor . . . . . . . . . 40 Recommendations for AMD-K6® Family and AMD Athlon Processor Blended Code . . . . . . . . . . . . . . . . . . . 41 5 Cache and Memory Optimizations 45 Memory Size and Alignment Issues . . . . . . . . . .
AMD Athlon™ Processor x86 Code Optimization 7 Scheduling Optimizations 22007E/0—November 1999 67 Schedule Instructions According to their Latency . . . . . . . . . . . . . . 67 Unrolling Loops. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Complete Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Partial Loop Unrolling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Use Function Inlining . . . . . . . . . . . . . . .
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Signed Derivation for Algorithm, Multiplier, and Shift Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 9 Floating-Point Optimizations 97 Ensure All FPU Data is Aligned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Use Multiplies Rather than Divides . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Use FFREEP Macro to Pop One Register from the FPU Stack . . . .
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Fast Conversion of Signed Words to Floating-Point . . . . . . . . . . . . 113 Use MMX PXOR to Negate 3DNow! Data . . . . . . . . . . . . . . . . . . . . 113 Use MMX PCMP Instead of 3DNow! PFCMP. . . . . . . . . . . . . . . . . . 114 Use MMX Instructions for Block Copies and Block Fills . . . . . . . . 115 Use MMX PXOR to Clear All Bits in an MMX Register . . . . . . . . . 118 Use MMX PCMPEQD to Set All Bits in an MMX Register . . . . . . .
22007E/0—November 1999 AMD Athlon™ Processor x86 Code Optimization Integer Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Floating-Point Scheduler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Floating-Point Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . Load-Store Unit (LSU). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . L2 Cache Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Write Combining . . . . . . . .
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 PerfCtr[3:0] MSRs (MSR Addresses C001_0004h–C001_0007h) . . . . . . . . . . . . . 167 Starting and Stopping the Performance-Monitoring Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Event and Time-Stamp Monitoring Software. . . . . . . . . . . . . . . . . . 168 Monitoring Counter Overflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22007E/0—November 1999 AMD Athlon™ Processor x86 Code Optimization List of Figures Figure 1. AMD Athlon™ Processor Block Diagram . . . . . . . . . . . 131 Figure 2. Integer Execution Pipeline . . . . . . . . . . . . . . . . . . . . . . . 135 Figure 3. Floating-Point Unit Block Diagram . . . . . . . . . . . . . . . . 137 Figure 4. Load/Store Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Figure 5. Fetch/Scan/Align/Decode Pipeline Hardware . . . . . . . . 142 Figure 6.
AMD Athlon™ Processor x86 Code Optimization xii 22007E/0—November 1999 List of Figures
22007E/0—November 1999 AMD Athlon™ Processor x86 Code Optimization List of Tables Table 1. Table 2. Table 3. Table 4. Table 5. Table 6. Table 7. Table 8. Table 9. Table 10. Table 11. Table 12. Table 13. Table 14. Table 15. Table 16. Table 17. Table 18. Table 19. Table 20. Table 21. Table 22. Table 23. Table 24. Table 25. Table 26. Table 27. Table 28. List of Tables Latency of Repeated String Instructions. . . . . . . . . . . . . 84 Integer Pipeline Operation Types . . . . . . . . . . . . . . . . .
AMD Athlon™ Processor x86 Code Optimization Table 29. Table 30. Table 31. Table 32. xiv 22007E/0—November 1999 VectorPath Integer Instructions . . . . . . . . . . . . . . . . . . . VectorPath MMX Instructions . . . . . . . . . . . . . . . . . . . . VectorPath MMX Extensions . . . . . . . . . . . . . . . . . . . . . VectorPath Floating-Point Instructions . . . . . . . . . . . . .
22007E/0—November 1999 AMD Athlon™ Processor x86 Code Optimization Revision History Date Rev Description Added “About this Document” on page 1. Further clarification of “Consider the Sign of Integer Operands” on page 14. Added the optimization, “Use Array Style Instead of Pointer Style Code” on page 15. Added the optimization, “Accelerating Floating-Point Divides and Square Roots” on page 29. Clarified examples in “Copy Frequently De-referenced Pointer Arguments to Local Variables” on page 31.
AMD Athlon™ Processor x86 Code Optimization xvi 22007E/0—November 1999 Revision History
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 1 Introduction The AMD Athlon™ processor is the newest microprocessor in the AMD K86™ family of microprocessors. The advances in the AMD Athlon processor take superscalar operation and out-of-order execution to a new level. The AMD Athlon processor has been designed to efficiently execute code written for previous-generation x86 processors.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 previous-generation processors and describes how those optimizations are applicable to the AMD Athlon processor. This guide contains the following chapters: Chapter 1: Introduction. Outlines the material covered in this document. Summarizes the AMD Athlon microarchitecture. Chapter 2: Top Optimizations. Provides convenient descriptions of the most important optimizations a programmer should take into consideration.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Appendix B: Pipeline and Execution Unit Resources Overview. Describes in detail the execution units and its relation to the instruction pipeline. Appendix C: Implementation of Write Combining. D e s c r i b e s the algorithm used by the AMD Athlon processor to write combine. Appendix D: Performance Monitoring Counters. Describes the usage of the performance counters available in the AMD Athlon processor.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 AMD Athlon™ Processor Microarchitecture Summary The AMD Athlon processor brings superscalar performance a nd hi gh op era t ing f req ue ncy t o P C s y st e m s r un ning industry-standard x86 software.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 AMD Athlon execution core to achieve and sustain maximum performance. As a decoupled decode/execution processor, the AMD Athlon processor makes use of a proprietary microarchitecture, which defines the heart of the AMD Athlon processor. With the inclusion of all these features, the AMD Athlon processor is capable of decoding, issuing, executing, and retiring multiple x86 instructions per cycle, resulting in superior scaleable performance.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 The coding techniques for achieving peak performance on the AMD Athlon processor include, but are not limited to, those for the AMD-K6, AMD-K6-2, Pentium®, Pentium Pro, and Pentium II processors. However, many of these optimizations are not necessary for the AMD Athlon processor to achieve maximum performance.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 2 Top Optimizations This chapter contains concise descriptions of the best o p t i m i z a t i o n s fo r i m p rov i n g t h e p e r fo r m a n c e o f t h e AMD Athlon™ processor. Subsequent chapters contain more detailed descriptions of these and other optimizations. The optimizations in this chapter are divided into two groups and listed in order of importance.
AMD Athlon™ Processor x86 Code Optimization ■ 22007E/0—November 1999 Avoid Placing Code and Data in the Same 64-Byte Cache Line Optimization Star ✩ TOP The top optimizations described in this chapter are flagged with a star. In addition, the star appears beside the more detailed descriptions found in subsequent chapters. Group I Optimizations — Essential Optimizations Memory Size and Alignment Issues See “Memory Size and Alignment Issues” on page 45 for more details.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 anywhere, in any type of code (integer, x87, 3DNow!, MMX, etc.). Use the following formula to determine prefetch distance: Prefetch Length = 200 (DS/C) ■ ■ ■ Round up to the nearest cache line. DS is the data stride per loop iteration. C is the number of cycles per loop iteration when hitting in the L1 cache. See “Use the 3DNow!™ PREFETCH and PREFETCHW Instructions” on page 46 for more details.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Avoid Load-Execute Floating-Point Instructions with Integer Operands ✩ TOP Do not use load-execute floating-point instructions with integer operands. The floating-point load-execute instructions with integer operands are VectorPath and generate two OPs in a cycle, while the discrete equivalent enables a third DirectPath instruction to be decoded in the same cycle.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Avoid Placing Code and Data in the Same 64-Byte Cache Line ✩ TOP Consider that the AMD Athlon processor cache line is twice the size of previous processors. Code and data should not be shared in the same 64-byte cache line, especially if the data ever becomes modified. In order to maintain cache coherency, the AMD Athlon processor may thrash its caches, resulting in lower performance.
AMD Athlon™ Processor x86 Code Optimization 12 22007E/0—November 1999 Group II Optimizations—Secondary Optimizations
22007E/0—November 1999 AMD Athlon™ Processor x86 Code Optimization 3 C Source Level Optimizations This chapter details C programming practices for optimizing code for the AMD Athlon™ processor. Guidelines are listed in order of importance. Ensure Floating-Point Variables and Expressions are of Type Float For compilers that generate 3DNow!™ instructions, make sure that all floating-point variables and expressions are of type float. Pay special attention to floating-point constants.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Consider the Sign of Integer Operands In many cases, the data stored in integer variables determines whether a signed or an unsigned integer type is appropriate. For example, to record the weight of a person in pounds, no negative numbers are required so an unsigned type is appropriate. However, recording temperatures in degrees Celsius may require both positive and negative numbers so a signed type is needed.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example (Avoid): int i; ====> i = i / 4; MOV CDQ AND ADD SAR MOV EAX, i SHR i, 2 EDX, 3 EAX, EDX EAX, 2 i, EAX Example (Preferred): unsigned int i; ====> i = i / 4; In summary: Use unsigned types for: ■ ■ ■ Division and remainders Loop counters Array indexing Use signed types for: ■ Integer-to-float conversion Use Array Style Instead of Pointer Style Code The use of pointers in C makes work difficult for the optimizers in C co
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Note that source code transformations will interact with a compiler’s code generator and that it is difficult to control the generated machine code from the source level. It is even possible that source code transformations for improving performance and compiler optimizations "fight" each other.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 *res++ = dp; /* write transformed z */ dp = vv->x * dp += vv->y * dp += vv->z * dp += vv->w * *m++; *m++; *m++; *m++; *res++ = dp; /* write transformed w */ ++vv; m -= 16; /* next input vertex */ /* reset to start of transform matrix */ } } Example 2 (Preferred): typedef struct { float x,y,z,w; } VERTEX; typedef struct { float m[4][4]; } MATRIX; void XForm (float *res, const float *v, const float *m, int numverts) { int i; const
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Completely Unroll Small Loops Take advantage of the AMD Athlon processor’s large, 64-Kbyte instruction cache and completely unroll small loops. Unrolling loops can be beneficial to performance, especially if the loop body is small which makes the loop overhead significant. Many compilers are not aggressive at unrolling loops.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 code in a way that avoids the store-to-load dependency. In some instances the language definition may prohibit the compiler from using code transformations that would remove the storeto-load dependency. It is therefore recommended that the programmer remove the dependency manually, e.g., by introducing a temporary variable that can be kept in a register. This can result in a significant performance increase.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Consider Expression Order in Compound Branch Conditions Branch c ondit ions in C prog rams are oft en com pound conditions consisting of multiple boolean expressions joined by the boolean operators && and ||. C guarantees a short-circuit evaluation of these operators. This means that in the case of ||, the first operand to evaluate to TRUE term inates the evaluation, i.e., following operands are not evaluated at all.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Switch Statement Usage Optimize Switch Statements Switch statements are translated using a variety of algorithms. The most common of these are jump tables and comparison chains/trees. It is recommended to sort the cases of a switch statement according to the probability of occurrences, with the most probable first. This will improve performance when the switch is translated as a comparison chain.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Use Const Type Qualifier Use the “const” type qualifier as much as possible. This optimization makes code more robust and may enable higher performance code to be generated due to the additional information available to the compiler. For example, the C standard allows compilers to not allocate storage for objects that are declared “const”, if their address is never taken.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Generalization for Multiple Constant Control Code To generalize this further for multiple constant control code some more work may have to be done to create the proper outer loop. Enumeration of the constant cases will reduce this to a simple switch statement. Example 2: for(i ...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 case combine( 1, 1 ): for( i ... ) { DoWork1( i ); DoWork3( i ); } break; default: break; } The trick here is that there is some up-front work involved in generating all the combinations for the switch constant and the total amount of code has doubled. However, it is also clear that the inner loops are "if()-free".
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 w h i ch m i g h t i nh ib it c e rt a i n o p t i m i z a t i o n s w i t h so m e compilers—for example, aggressive inlining. Dynamic Memory Allocation Consideration Dynamic memory allocation (‘malloc’ in C language) should always return a pointer that is suitably aligned for the largest base type (quadword alignment).
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 lead to unexpected results. Fortunately, in the vast majority of cases, the final result will differ only in the least significant bits. Example 1 (Avoid): double a[100],sum; int i; sum = 0.0f; for (i=0; i<100; i++) { sum += a[i]; } Example 2 (Preferred): double a[100],sum1,sum2,sum3,sum4,sum; int i; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 1 Avoid: double a,b,c,d,e,f; e = b*c/d; f = b/d*a; Preferred: double a,b,c,d,e,f,t; t = b/d; e = c*t; f = a*t; Example 2 Avoid: double a,b,c,e,f; e = a/c; f = b/c; Preferred: double a,b,c,e,f,t; t = 1/c; e = a*t f = b*t; C Language Structure Component Considerations Many compilers have options that allow padding of structures to make their siz e multiples of words, doublewords, or quadwords, in order to achieve better ali
AMD Athlon™ Processor x86 Code Optimization Pad by Multiple of Largest Base Type Size 22007E/0—November 1999 Pad the structure to a multiple of the largest base type size of any member. In this fashion, if the first member of a structure is naturally aligned, all other members are naturally aligned as well. The padding of the structure to a multiple of the largest based type size allows, for example, arrays of structures to be perfectly aligned.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 quadword alignment), so that quadword operands might be misaligned, even if this technique is used and the compiler does allocate variables in the order they are declared.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 necessary for the currently selected precision. This means that setting precision control to single precision (versus Win32 default of double precision) lowers the latency of those operations. The Microsoft ® Visual C environment provides functions to manipulate the FPU control word and thus the precision control.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Avoid Unnecessary Integer Division Integer division is the slowest of all integer arithmetic operations and should be avoided wherever possible. One possibility for reducing the number of integer divisions is multiple divisions, in which division can be replaced with multiplication as shown in the following examples. This replacement is possible only if no overflow occurs during the computation of the product.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 1 (Avoid): //assumes pointers are different and q!=r void isqrt ( unsigned long a, unsigned long *q, unsigned long *r) { *q = a; if (a > 0) { while (*q > (*r = a / *q)) { *q = (*q + *r) >> 1; } } *r = a - *q * *q; } Example 2 (Preferred): //assumes pointers are different and q!=r void isqrt ( unsigned long a, unsigned long *q, unsigned long *r) { unsigned long qq, rr; qq = a; if (a > 0) { while (qq > (rr = a / qq)) { qq = (qq + rr
22007E/0—November 1999 AMD Athlon™ Processor x86 Code Optimization 4 Instruction Decoding Optimizations This chapter discusses ways to maximize the number of instructions decoded by the instruction decoders in the AMD Athlon™ processor. Guidelines are listed in order of importance. Overview The AMD Athlon processor instruction fetcher reads 16-byte aligned code windows from the instruction cache. The instruction bytes are then merged into a 24-byte instruction queue.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Select DirectPath Over VectorPath Instructions ✩ TOP U s e D i re c t Pa t h i n s t r u c t i o n s ra t h e r t h a n Ve c t o r Pa t h instructions. DirectPath instructions are optimized for decode and execute efficiently by minimizing the number of operations per x86 instruction, which includes ‘register ← register op memory’ as well as ‘register ← register op register’ forms of instructions.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Use Load-Execute Floating-Point Instructions with Floating-Point Operands ✩ TOP When operating on single-precision or double-precision floating-point data, wherever possible use floating-point load-execute instructions to increase code density. Note: This optimization applies only to floating-point instructions with floating-point operands and not with integer operands, as described in the next optimization.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 1 (Avoid): FLD FIMUL FIADD QWORD PTR [foo] DWORD PTR [bar] DWORD PTR [baz] Example 2 (Preferred): FILD FILD FLD FMULP FADDP DWORD PTR [bar] DWORD PTR [baz] QWORD PTR [foo] ST(2), ST ST(1),ST Align Branch Targets in Program Hot Spots In program hot spots (i.e., innermost loops in the absence of profiling data), place branch targets at or near the beginning of 16-byte aligned code windows.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 2 (Preferred): 05 78 56 34 12 add eax, 12345678h 83 C3 FB add ebx, -5 74 05 jz $label1 ;uses single byte ; opcode form ;uses 8-bit sign ; extended immediate ;uses 1-byte opcode, ; 8-bit immediate Avoid Partial Register Reads and Writes In order to handle partial register writes, the AMD Athlon processor execution core implements a data-merging scheme.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Replace Certain SHLD Instructions with Alternative Code Certain instances of the SHLD instruction can be replaced by alternative code using SHR and LEA. The alternative code has lower latency and requires less execution resources. SHR and LEA (32-bit version) are DirectPath instructions, while SHLD is a VectorPath instruction. SHR and LEA preserves decode bandwidth as it potentially enables the decoding of a third DirectPath instruction.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Use 8-Bit Sign-Extended Displacements Use 8-bit sign-extended displacements for conditional branches. Using short, 8-bit sign-extended displacements for conditional branches improves code density with no negative effects on the AMD Athlon processor. Code Padding Using Neutral Code Fillers Occasionally a need arises to insert neutral code fillers into the code stream, e.g., for code alignment purposes or to space out branches.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Recommendations for the AMD Athlon™ Processor For code that is optimized specifically for the AMD Athlon processor, the optimal code fillers are NOP instructions (opcode 0x90) with up to two REP prefixes (0xF3). In the AMD Athlon processor, a NOP with up to two REP prefixes can be handled by a single decoder with no overhead.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Recommendations for AMD-K6® Family and AMD Athlon™ Processor Blended Code On x86 processors other than the AMD Athlon processor (including the AMD-K6 family of processors), the REP prefix and especially multiple prefixes cause decoding overhead, so the above technique is not recommended for code that has to run well both on AMD Athlon processor and other x86 processors (blended code).
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 NOP3_ECX NOP3_EDX NOP3_ESI NOP3_EDI NOP3_ESP NOP3_EBP TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU 08Dh,014h,022h> 08Dh,024h,024h> 08Dh,034h,026h> 08Dh,03Ch,027h> 08Dh,06Dh,000h> ;lea ;lea ;lea ;lea ;lea ;lea ecx, edx, esi, edi, esp, ebp, [ecx] [edx] [esi] [edi] [esp] [ebp] NOP4_EAX NOP4_EBX NOP4_ECX NOP4_EDX NOP4_ESI NOP4_EDI NOP4_ESP TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQ
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 ;lea edi ,[edi+00000000] NOP6_EDI TEXTEQU ;lea ebp ,[ebp+00000000] NOP6_EBP TEXTEQU ;lea eax,[eax*1+00000000] NOP7_EAX TEXTEQU ;lea ebx,[ebx*1+00000000] NOP7_EBX TEXTEQU ;lea ecx,[ecx*1+00000000] NOP7_ECX TEXTEQU ;lea edx,[edx*1+00000000] NOP7_EDX TEXTEQU ;lea esi,[esi*1+00000000] NOP
AMD Athlon™ Processor x86 Code Optimization 44 22007E/0—November 1999 Code Padding Using Neutral Code Fillers
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 5 Cache and Memory Optimizations This chapter describes code optimization techniques that take advantage of the large L1 caches and high-bandwidth buses of the AMD Athlon™ processor. Guidelines are listed in order of importance. Memory Size and Alignment Issues Avoid Memory Size Mismatches ✩ TOP Avoid memory size mismatches when instructions operate on the same data.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Align Data Where Possible ✩ TOP In general, avoid misaligned data references. All data whose size is a power of 2 is considered aligned if it is naturally aligned. For example: ■ QWORD accesses are aligned if they access an address divisible by 8. ■ DWORD accesses are aligned if they access an address divisible by 4. WORD accesses are aligned if they access an address divisible by 2.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 PREFETCH/W versus PREFETCHNTA/T0/T1 /T2 The PREFETCHNTA/T0/T1/T2 instructions in the MMX extensions are processor implementation dependent. To maintain compatibility with the 25 million AMD-K6 ® -2 and A M D -K 6 -I I I p ro c e s s o rs a lre a dy s o l d , u se t h e 3 D N ow ! PREFETCH/W instructions instead of the various prefetch flavors in the new MMX extensions.
AMD Athlon™ Processor x86 Code Optimization MOV MOV MOV MOV ECX, EAX, EDX, ECX, 22007E/0—November 1999 (-LARGE_NUM) OFFSET array_a OFFSET array_b OFFSET array_c ;used biased ;get address ;get address ;get address index of array_a of array_b of array_c $loop: PREFETCHW PREFETCH PREFETCH FLD QWORD FMUL QWORD FSTP QWORD FLD QWORD FMUL QWORD FSTP QWORD [EAX+196] ;two [EDX+196] ;two [ECX+196] ;two PTR [EDX+ECX*8+ARR_SIZE] PTR [ECX+ECX*8+ARR_SIZE] PTR [EAX+ECX*8+ARR_SIZE] PTR [EDX+ECX*8+ARR_SIZE+8] PTR [E
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 The following optimization rules were applied to this example. ■ ■ ■ Determining Prefetch Distance Loops should be unrolled to make sure that the data stride per loop iteration is equal to the length of a cache line. This avoids overlapping PREFETCH instructions and thus optimal use of the available number of outstanding PREFETCHes.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Take Advantage of Write Combining ✩ TOP Operating system and device driver programmers should take a dva n t a g e o f t h e w ri t e -c o m b i n i n g c a p ab il it ie s o f t h e AMD Athlon processor. The AMD Athlon processor has a very aggressive write-combining algorithm, which improves performance significantly. See Appendix C, “Implementation of Write Combining” on page 155 for more details.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Store-to-Load Forwarding Restrictions Store-to-load forwarding refers to the process of a load reading (forwarding) data from the store buffer (LS2). There are instances in the AMD Athlon processor load/store architecture when either a load operation is not allowed to read needed data from a store in the store buffer, or a load OP detects a false data dependency on a store in the store buffer.
AMD Athlon™ Processor x86 Code Optimization Narrow-to-Wide Store-Buffer Data Forwarding Restriction 22007E/0—November 1999 I f t h e f o l l o w i n g c o n d i t i o n s a re p re s e n t , t h e re i s a narrow-to-wide store-buffer data forwarding restriction: ■ The operand size of the store data is smaller than the operand size of the load data. ■ The range of addresses spanned by the store data covers some sub-region of range of addresses spanned by the load data.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 5 (Preferred): MOVD PUNPCKHDQ MOVD ... ADD ADD Misaligned Store-Buffer Data Forwarding Restriction [foo], MM1 MM1, MM1 [foo+4], MM1 ;store lower half ;get upper half into lower half ;store lower half EAX, [foo] EDX, [foo+4] ;fine ;fine If the following condition is present, there is a misaligned store-buffer data forwarding restriction: ■ The store or load address is misaligned.
AMD Athlon™ Processor x86 Code Optimization One Supported Storeto-Load Forwarding Case 22007E/0—November 1999 There is one case of a mismatched store-to-load forwarding that is supported by the by AMD Athlon processor. The lower 32 bits from an aligned QWORD write feeding into a DWORD read is allowed. Example 8 (Allowed): MOVQ ...
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example (Preferred): Prolog: PUSH MOV SUB AND EBP EBP, ESP ESP, SIZE_OF_LOCALS ;size of local variables ESP, –8 ;push registers that need to be preserved Epilog: ;pop register that needed to be preserved MOV ESP, EBP POP EBP RET With this technique, function arguments can be accessed via EBP, and local variables can be accessed via ESP.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example: struct { char a[5]; long k; doublex; } baz; The structure components should be allocated (lowest to highest address) as follows: x, k, a[4], a[3], a[2], a[1], a[0], padbyte6, ..., padbyte0 See “C Language Structure Component Considerations” on page 27 for more information from a C source code perspective.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 6 Branch Optimizations Wh i l e t h e A M D A t h l o n ™ p ro c e s s o r c o n t a i n s a ve ry sophisticated branch unit, certain optimizations increase the effectiveness of the branch prediction unit. This chapter discusses rules that improve branch prediction and minimize branch penalties. Guidelines are listed in order of importance.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 AMD Athlon™ Processor Specific Code Example 1 — Signed integer ABS function (X = labs(X)): MOV MOV NEG CMOVS MOV ECX, EBX, ECX ECX, [X], [X] ECX EBX ECX ;load value ;save value ;–value ;if –value is negative, select value ;save labs result Example 2 — Unsigned integer min function (z = x < y ? x : y): MOV MOV CMP CMOVNC MOV EAX, EBX, EAX, EAX, [Z], [X] [Y] EBX EBX EAX ;load X value ;load Y value ;EBX<=EAX ? CF=0 : CF=1 ;EAX=(EBX<=EA
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 6 — Increment Ring Buffer Offset: //C Code char buf[BUFSIZE]; int a; if (a < (BUFSIZE-1)) { a++; } else { a = 0; } ;------------;Assembly Code MOV EAX, [a] CMP EAX, (BUFSIZE-1) INC EAX SBB EDX, EDX AND EAX, EDX MOV [a], EAX ; ; ; ; ; ; old offset a < (BUFSIZE-1) ? CF : NC a++ a < (BUFSIZE-1) ? 0xffffffff :0 a < (BUFSIZE-1) ? a++ : 0 store new offset Example 7 — Integer Signum Function: //C Code int a, s; if (!a) { s = } else if
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Replace Branches with Computation in 3DNow!™ Code Branches negatively impact the performance of 3DNow! code. Branches can operate only on one data item at a time, i.e., they are inherently scalar and inhibit the SIMD processing that makes 3DNow! code superior.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 2 (Preferred): ; r = (x < y) ? a : b ; ; in: mm0 a ; mm1 b ; mm2 x ; mm3 y ; out: mm1 r PCMPGTD PAND PANDN POR MM3, MM1, MM3, MM1, MM2 MM3 MM0 MM3 ; ; ; ; y y y r > > > = x x x y ? ? > > 0xffffffff : 0 b : 0 0 : a x ? b : a " Sample Code Translated into 3DNow!™ Code The following examples use scalar code translated into 3DNow! code.
AMD Athlon™ Processor x86 Code Optimization Example 2: 22007E/0—November 1999 C code: float x,z; z = abs(x); if (z >= 1) { z = 1/z; } 3DNow! code: ;in: MM0 = x ;out: MM0 = z MOVQ MM5, PAND MM0, PFRCP MM2, MOVQ MM1, PFRCPIT1 MM0, PFRCPIT2 MM0, PFMIN MM0, Example 3: mabs MM5 MM0 MM0 MM2 MM2 MM1 ;0x7fffffff ;z=abs(x) ;1/z approx ;save z ;1/z step ;1/z final ;z = z < 1 ? z : 1/z C code: float x,z,r,res; z = fabs(x) if (z < 0.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 4: C code: #define PI 3.14159265358979323 float x,z,r,res; /* 0 <= r <= PI/4 */ z = abs(x) if (z < 1) { res = r; } else { res = PI/2-r; } 3DNow! code: ;in: MM0 = x ; MM1 = r ;out: MM1 = res MOVQ MM5, mabs MOVQ MM6, one PAND MM0, MM5 PCMPGTD MM6, MM0 MOVQ MM4, pio2 PFSUB MM4, MM1 PANDN MM6, MM4 PFMAX MM1, MM6 Replace Branches with Computation in 3DNow!™ Code ; ; ; ; ; ; ; ; mask to clear sign bit 1.
AMD Athlon™ Processor x86 Code Optimization Example 5: 22007E/0—November 1999 C code: #define PI 3.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Avoid the Loop Instruction The LOOP instruction in the AMD Athlon processor requires eight cycles to execute. Use the preferred code shown below: Example 1 (Avoid): LOOP LABEL Example 2 (Preferred): DEC JNZ ECX LABEL Avoid Far Control Transfer Instructions Avoid using far control transfer instructions. Far control transfer branches can not be predicted by the branch target buffer (BTB).
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Avoid Recursive Functions Avoid recursive functions due to the danger of overflowing the return address stack. Convert end-recursive functions to iterative code. An end-recursive function is when the function call to itself is at the end of the code.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 7 Scheduling Optimizations This chapter describes how to code instructions for efficient scheduling. Guidelines are listed in order of importance. Schedule Instructions According to their Latency The AMD Athlon™ processor can execute up to three x86 instructions per cycle, with each x86 instruction possibly having a different latency.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 unrolling reduces register pressure by removing the loop counter. To completely unroll a loop, remove the loop control and replicate the loop body N times. In addition, completely unrolling a loop increases scheduling opportunities. Only unrolling very large code loops can result in the inefficient use of the L1 instruction cache.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Without Loop Unrolling: MOV ECX, MAX_LENGTH MOV EAX, OFFSET A MOV EBX, OFFSET B $add_loop: FLD QWORD PTR [EAX] FADD QWORD PTR [EBX] FSTP QWORD PTR [EAX] ADD EAX, 8 ADD EBX, 8 DEC ECX JNZ $add_loop The loop consists of seven instructions. The AMD Athlon processor can decode/retire three instructions per cycle, so it cannot execute faster than three iterations in seven cycles, or 3/7 floating-point adds per cycle.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 n o f a s t e r t h a n t h re e i t e ra t i o n s i n 1 0 cy c l e s , o r 6 / 1 0 floating-point adds per cycle, or 1.4 times as fast as the original loop. Deriving Loop Control For Partially Unrolled Loops A frequently used loop construct is a counting loop.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Use Function Inlining Overview Make use of the AMD Athlon processor’s large 64-Kbyte in str uct io n ca che by inl in ing s m a ll rou t in es to avoi d procedure-call overhead. Consider the cost of possible increased register usage, which can increase load/store instructions for register spilling.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Always Inline Functions if Called from One Site A function should always be inlined if it can be established that it is called from just one site in the code. For the C language, determination of this characteristic is made easier if functions are explicitly declared static unless they require external linkage.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 1 (Avoid): ADD MOV MOV MOV EBX, EAX, ECX, EDX, ECX DWORD PTR [10h] DWORD PTR [EAX+EBX] DWORD PTR [24h] ;inst 1 ;inst 2 (fast address calc.) ;inst 3 (slow address calc.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 1 (Avoid): int a[MAXSIZE], b[MAXSIZE], c[MAXSIZE], i; for (i=0; i < MAXSIZE; i++) { c [i] = a[i] + b[i]; } MOV XOR XOR XOR ECX, ESI, EDI, EBX, MAXSIZE ESI EDI EBX $add_loop: MOV EAX, [ESI + a] MOV EDX, [EDI + b] ADD EAX, EDX MOV [EBX + c], EAX ADD ESI, 4 ADD EDI, 4 ADD EBX, 4 DEC ECX JNZ $add_loop ;initialize ;initialize ;initialize ;initialize loop counter offset into array a offset into array b offset into array c ;get elem
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 variable that starts with a negative value and reaches zero when the loop expires. Note that if the base addresses are held in registers (e.g., when the base addresses are passed as arguments of a function) biasing the base addresses requires additional instructions to perform the biasing at run time and a small amount of additional overhead is incurred.
AMD Athlon™ Processor x86 Code Optimization 76 22007E/0—November 1999 Push Memory Data Carefully
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 8 Integer Optimizations This chapter describes ways to improve integer performance through optimized programming techniques. The guidelines are listed in order of importance. Replace Divides with Multiplies Replace integer division by constants with multiplication by the reciprocal.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Signed Division Utility In the opt_utilities directory of the AMD documentation CDROM, run sdiv.exe in a DOS window to find the fastest code for signed division by a constant. The utility displays the code after the user enters a signed constant divisor. Type “sdiv > example.out” to output the code to a file. Unsigned Division Utility In the opt_utilities directory of the AMD documentation CDROM, run udiv.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 1: ;In: ;Out: XOR EDX, CMP EAX, SBB EDX, EDX = dividend EDX = quotient EDX;0 d ;CF = (dividend < divisor) ? 1 : 0 -1 ;quotient = 0+1-CF = (dividend < divisor) ? 0 : 1 In cases where the dividend does not need to be preserved, the division can be accomplished without the use of an additional register, thus reducing register pressure.
AMD Athlon™ Processor x86 Code Optimization ;algorithm MOV EAX, MOV EDX, MOV ECX, IMUL EDX ADD EDX, SHR ECX, SAR EDX, ADD EDX, 22007E/0—November 1999 1 m dividend EDX ECX 31 s ECX ;quotient in EDX Derivation for a, m, s The derivation for the algorithm (a), multiplier (m), and shift count (s), is found in the section “Signed Derivation for Algorithm, Multiplier, and Shift Factor” on page 95.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Remainder of Signed Integer 2n or –(2n) ;IN:EAX = dividend ;OUT:EAX = remainder CDQ AND EDX, (2^n–1) ADD EAX, EDX AND EAX, (2^n–1) SUB EAX, EDX MOV [remainder], EAX ;Sign extend into EDX ;Mask correction (abs(divison)–1) ;Apply pre-correction ;Mask out remainder (abs(divison)–1) ;Apply pre-correction, if necessary Use Alternative Code When Multiplying by a Constant A 32-bit integer multiply by a constant has a latency of five cycles.
AMD Athlon™ Processor x86 Code Optimization by 11: LEA ADD ADD REG2, [REG1*8+REG1] REG1, REG1 REG1, REG2 by 12: SHL LEA REG1, 2 REG1, [REG1*2+REG1] LEA SHL SUB LEA LEA ADD REG2, REG1, REG1, REG2, REG1, REG1, by 15: MOV SHL SUB REG2, REG1 REG1, 4 REG1, REG2 ;2 cycles by 16: SHL REG1, 4 ;1 cycle by 17: MOV SHL ADD REG2, REG1 REG1, 4 REG1, REG2 ;2 cycles by 18: ADD LEA REG1, REG1 REG1, [REG1*8+REG1] ;3 cycles by 19: LEA SHL ADD REG2, [REG1*2+REG1] REG1, 4 REG1, REG2 ;3 cycles by 20
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 by 26: use IMUL by 27: LEA SHL SUB REG2, [REG1*4+REG1] REG1, 5 REG1, REG2 ;3 cycles by 28: MOV SHL SUB SHL REG2, REG1, REG1, REG1, ;3 cycles by 29: LEA SHL SUB REG2, [REG1*2+REG1] REG1, 5 REG1, REG2 ;3 cycles by 30: MOV SHL SUB ADD REG2, REG1, REG1, REG1, REG1 4 REG2 REG1 ;3 cycles by 31: MOV SHL SUB REG2, REG1 REG1, 5 REG1, REG2 ;2 cycles by 32: SHL REG1, 5 ;1 cycle REG1 3 REG2 2 Use MMX™ Instructions for Integ
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 In addition, using MMX instructions increases the available parallelism. The AMD Athlon processor can issue three integer OPs and two MMX OPs per cycle. Repeated String Instruction Usage Latency of Repeated String Instructions Table 1 shows the latency for repeated string instructions on the AMD Athlon processor. Table 1.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Ensure DF=0 (UP) Always make sure that DF = 0 (UP) (after execution of CLD) for REP MOVS and REP STOS. DF = 1 (DOWN) is only needed for certain cases of overlapping REP MOVS (for example, source and destination overlap). While string instructions with DF = 1 (DOWN) are slower, only the overhead part of the cycle equation is larger and not the throughput part.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Use XOR Instruction to Clear Integer Registers To clear an integer register to all 0s, use “XOR reg, reg”. The AMD Athlon processo r is able to avoid the false rea d dependency on the XOR instruction. Example 1 (Acceptable): MOV REG, 0 Example 2 (Preferred): XOR REG, REG Efficient 64-Bit Integer Arithmetic This section contains a collection of code snippets and subroutines showing the efficient implementation of 64-bit arithmetic.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 4 (Left shift): ;shift operand in EDX:EAX ; applied modulo 64) SHLD EDX, EAX, CL SHL EAX, CL TEST ECX, 32 JZ $lshift_done MOV EDX, EAX XOR EAX, EAX left, shift count in ECX (count ;first apply shift count ; mod 32 to EDX:EAX ;need to shift by another 32? ;no, done ;left shift EDX:EAX ; by 32 bits $lshift_done: Example 5 (Right shift): SHRD SHR TEST JZ MOV XOR EAX, EDX, CL EDX, CL ECX, 32 $rshift_done EAX, EDX EDX, EDX ;first a
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 7 (Division): ;_ulldiv divides two unsigned 64-bit integers, and returns ; the quotient.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 MOV IMUL MUL ADD SUB MOV MOV SBB SBB XOR POP POP RET ECX, EAX EDI, EAX ;save quotient ;quotient * divisor hi-word ; (low only) DWORD PTR [ESP+20];quotient * divisor lo-word EDX, EDI ;EDX:EAX = quotient * divisor EBX, EAX ;dividend_lo – (quot.*divisor)_lo EAX, ECX ;get quotient ECX, [ESP+16] ;dividend_hi ECX, EDX ;subtract divisor * quot.
AMD Athlon™ Processor x86 Code Optimization $r_two_divs: MOV ECX, EAX MOV EAX, EDX XOR EDX, EDX DIV EBX MOV DIV MOV XOR POP RET EAX, ECX EBX EAX, EDX EDX, EDX EBX 22007E/0—November 1999 ;save dividend_lo in ECX ;get dividend_hi ;zero extend it into EDX:EAX ;EAX = quotient_hi, EDX = intermediate ; remainder ;EAX = dividend_lo ;EAX = quotient_lo ;EAX = remainder_lo ;EDX = remainder_hi = 0 ;restore EBX as per calling convention ;done, return to caller $r_big_divisor: PUSH EDI ;save EDI as per calling conv
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Efficient Implementation of Population Count Function Population count is an operation that determines the number of set bits in a bit string. For example, this can be used to determine the cardinality of a set. The following example code shows how to efficiently implement a population count operation for 32-bit operands. The example is written for the inline assembler of Microsoft Visual C.
AMD Athlon™ Processor x86 Code Optimization Step 3 22007E/0—November 1999 For the first time, the value in each k-bit field is small enough that adding two k-bit fields results in a value that still fits in the k-bit field. Thus the following computation is performed: y = (x + (x >> 4)) & 0x0F0F0F0F The result is four 8-bit fields whose lower half has the desired sum and whose upper half contains "junk" that has to be masked out.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 ADD EAX, EDX ;x = (w & 0x33333333) + ((w >> 2) & ; 0x33333333) EDX, EDX ;x EAX, 4 ;x >> 4 EAX, EDX ;x + (x >> 4) EAX, 00F0F0F0Fh ;y = (x + (x >> 4) & 0x0F0F0F0F) EAX, 001010101h ;y * 0x01010101 EAX, 24 ;population count = (y * ; 0x01010101) >> 24 retVal, EAX ;store result MOV SHR ADD AND IMUL SHR MOV } return (retVal); } Derivation of Multiplier Used for Integer Division by Constants Unsigned Derivation for Algorithm, Multiplier, and
AMD Athlon™ Processor x86 Code Optimization ;algorithm MOV EDX, MOV EAX, MUL EDX ADD EAX, ADC EDX, SHR EDX, */ 22007E/0—November 1999 1 dividend m m 0 s ;EDX=quotient typedef unsigned __int64 typedef unsigned long U64; U32; U32 d, l, s, m, a, r; U64 m_low, m_high, j, k; U32 log2 (U32 i) { U32 t = 0; i = i >> 1; while (i) { i = i >> 1; t++; } return (t); } /* Generate m, s for algorithm 0. Based on: Granlund, T.; Montgomery, P.L.:"Division by Invariant Integers using Multiplication”.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 /* Generate m, s for algorithm 1. Based on: Magenheimer, D.J.; et al: “Integer Multiplication and Division on the HP Precision Architecture”. IEEE Transactions on Computers, Vol 37, No. 8, August 1988, page 980.
AMD Athlon™ Processor x86 Code Optimization ;algorithm MOV EAX, MOV EDX, MOV ECX, IMUL EDX ADD EDX, SHR ECX, SAR EDX, ADD EDX, */ 22007E/0—November 1999 1 m dividend EDX ECX 31 s ECX ; quotient in EDX typedef unsigned __int64 typedef unsigned long U64; U32; U32 log2 (U32 i) { U32 t = 0; i = i >> 1; while (i) { i = i >> 1; t++; } return (t); } U32 d, l, s, m, a; U64 m_low, m_high, j, k; /* Determine algorithm (a), multiplier (m), and shift count (s) for 32-bit signed integer division.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 9 Floating-Point Optimizations T h i s ch a p t e r d e t a i l s t h e m e t h o d s u s e d t o o p t i m i z e floating-point code to the pipelined floating-point unit (FPU). Guidelines are listed in order of importance. Ensure All FPU Data is Aligned As discussed in “Memory Size and Alignment Issues” on page 45, floating-point data should be naturally aligned.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Use FFREEP Macro to Pop One Register from the FPU Stack In FPU intensive code, frequently accessed data is often pre-loaded at the bottom of the FPU stack before processing floating-point data. After completion of processing, it is desirable to remove the pre-loaded data from the FPU stack as quickly as possible.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 These instructions are much faster than the classical approach using FSTSW, because FSTSW is essentially a serializing instruction on the AMD Athlon processor. When FSTSW cannot be avoided (for example, backward compatibility of code with older processors), no FPU instruction should occur between an FCOM[P], FICOM[P], FUCOM[P], or FTST and a dependent FSTSW.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Minimize Floating-Point-to-Integer Conversions C++, C, and Fortran define floating-point-to-integer conversions as truncating. This creates a problem because the active rounding mode in an application is typically round-to-nearesteven.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 FP U into tr uncating mo de, and perfor ming all of the conversions before restoring the original control word. The speed of the above code is somewhat dependent on the nature of the code surrounding it. For applications in which the speed of floating-point-to-integer conversions is extremely critical for application performance, experiment with either of the following substitutions, which may or may not be faster than the code above.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 3 (Potentially faster): MOV ECX, DWORD PTR[X+4] ;get upper 32 bits of double XOR EDX, EDX ;i = 0 MOV EAX, ECX ;save sign bit AND ECX, 07FF00000h ;isolate exponent field CMP ECX, 03FF00000h ;if abs(x) < 1.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Floating-Point Subexpression Elimination There are cases which do not require an FXCH instruction after every instruction to allow access to two new stack entries. In the cases where two instructions share a source operand, an FXCH is not required between the two instructions.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 If an “argument out of range” is detected, a range reduction subroutine is invoked which reduces the argument to less than 2^63 before the instruction is attempted again. While an argument > 2^63 is unusual, it often indicates a problem elsewhere in the code and the code may completely fail in the absence of a properly guarded trigonometric instruction.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Since out-of-range arguments are extremely uncommon, the conditional branch will be perfectly predicted, and the other instructions used to guard the trigonometric instruction can execute in parallel to it. Take Advantage of the FSINCOS Instruction Frequently, a piece of code that needs to compute the sine of an argument also needs to compute the cosine of that same argument.
AMD Athlon™ Processor x86 Code Optimization 106 22007E/0—November 1999 Take Advantage of the FSINCOS Instruction
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 10 3DNow!™ and MMX™ Optimizations This chapter describes 3DNow! and MMX code optimization techniques for the AMD Athlon™ processor. Guidelines are listed in order of importance. 3DNow! porting guidelines can be found in the 3DNow!™ Instruction Porting Guide, order# 22621.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 FEMMS instruction is supported for backward compatibility with AMD-K6 family processors, and is aliased to the EMMS instruction. 3DNow! and MMX instructions are designed to be used concurrently with no switching issues. Likewise, enhanced 3DNow! instructions can be used simultaneously with MMX instructions.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Pipelined Pair of 24-Bit Precision Divides This divide operation executes with a total latency of 21 cycles, assuming that the program hides the latency of the first MOVD/MOVQ instructions within preceding code.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Use 3DNow!™ Instructions for Fast Square Root and Reciprocal Square Root 3DNow! instructions can be used to compute a very fast, highly accurate square root and reciprocal square root. Optimized 15-Bit Precision Square Root This square root operation can be executed in only 7 cycles, assuming a program hides the latency of the first MOVD instruction within previous code.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Newton-Raphson Reciprocal Square Root The general Newton-Raphson reciprocal square root recurrence is: Zi+1 = 1/2 • Zi • (3 – b • Zi2) To reduce the number of iterations, the initial approximation read from a table. The 3DNow! reciprocal square root approximation is accurate to at least 15 bits.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example: PXOR MOVD MOVD PUNPCKLWD PUNCPKLWD PMADDWD MM2, MM0, MM1, MM0, MM1, MM0, MM2 [ab] [cd] MM2 MM2 MM1 ; ; ; ; ; ; 0 0 0 0 0 0 b 0 d b*d | | | | | | 0 b a d c 0 a 0 c a*c 3DNow!™ and MMX™ Intra-Operand Swapping AMD Athlon™ Specific Code If the swapping of MMX register halves is necessary, use the PSWAPD instruction, which is a new AMD Athlon 3DNow! DSP ex t e ns i o n.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Fast Conversion of Signed Words to Floating-Point In many applications there is a need to quickly convert data consisting of packed 16-bit signed integers into floating-point numbers. The following two examples show how this can be accomplished efficiently on AMD processors. The first example shows how to do the conversion on a processor that supports AMD ’s 3 DN ow! ex te n si on s, such as t h e AMD Athlon processor.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 cycle bypassing penalty, and another one cycle penalty if the result goes to a 3DNow! operation. The PFMUL execution latency is four, therefore, in the worst case, the PXOR and PMUL instructions are the same in terms of latency. On the AMD-K6 processor, there is only a one cycle latency for PXOR, versus a two cycle latency for the 3DNow! PFMUL instruction.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Use MMX™ Instructions for Block Copies and Block Fills For moving or filling small blocks of data (e.g., less than 512 bytes) between cacheable memory areas, the REP MOVS and REP STOS families of instructions deliver good performance and are straightforward to use.
AMD Athlon™ Processor x86 Code Optimization $xfer: movq add movq add movq movq movq movq movq movq movq movq movq movq movq movq movq dec movq jnz femms } 22007E/0—November 1999 mm0, [eax] edx, 64 mm1, [eax+8] eax, 64 mm2, [eax-48] [edx-64], mm0 mm0, [eax-40] [edx-56], mm1 mm1, [eax-32] [edx-48], mm2 mm2, [eax-24] [edx-40], mm0 mm0, [eax-16] [edx-32], mm1 mm1, [eax-8] [edx-24], mm2 [edx-16], mm0 ecx [edx-8], mm1 $xfer /* block fill (destination QWORD aligned) */ __asm { mov mov shr movq edx, ecx, ecx,
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 AMD Athlon™ Processor Specific Code The following example code, written for the inline assembler of Microsoft Visual C, is suitable for moving/filling a quadword aligned block of data in the following situations: ■ ■ AMD Athlon processor specific code where the destination of the block copy is in non-cacheable memory space AMD Athlon processor specific code where the destination of the block copy is in cacheable space, but no immediate
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 /* block fill (destination QWORD aligned) */ __asm { mov mov shr movq edx, ecx, ecx, mm0, [dst_ptr] [blk_size] 6 [fill_data] align 16 $fill_nc: movntq movntq movntq movntq movntq movntq movntq movntq add dec jnz femms sfence } [edx], mm0 [edx+8], mm0 [edx+16], mm0 [edx+24], mm0 [edx+32], mm0 [edx+40], mm0 [edx+48], mm0 [edx+56], mm0 edx, 64 ecx $fill_nc Use MMX™ PXOR to Clear All Bits in an MMX™ Register To clear all the bits in an MM
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Use MMX™ PCMPEQD to Set All Bits in an MMX™ Register To set all the bits in an MMX register to one, use: PCMPEQD MMreg, MMreg Note that PCMPEQD MMreg, MMreg is dependent on previous writes to MMreg. Therefore, using PCMPEQD in the manner described can lengthen dependency chains, which in return may lead to reduced performance. An alternative in such cases is to use: ones DQ 0FFFFFFFFFFFFFFFFh MOVQ MMreg, QWORD PTR [ones] i.e.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 /* Function XForm performs a fully generalized 3D transform on an array of vertices pointed to by "v" and stores the transformed vertices in the location pointed to by "res". Each vertex consists of four floats. The 4x4 transform matrix is pointed to by "m". The matrix elements are also floats. The argument "numverts" indicates how many vertices have to be transformed.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 $$xform: ADD MOVQ MOVQ ADD MOVQ MOVQ PUNPCKLDQ MOVQ PFMUL PUNPCKHDQ PFMUL MOVQ MOVQ MOVQ PFMUL MOVQ PUNPCKLDQ PFMUL MOVQ PFMUL PFADD EBX, MM0, MM1, EDX, MM2, MM3, MM0, MM4, MM3, MM2, MM4, MM5, MM7, MM6, MM5, MM0, MM1, MM7, MM2, MM0, MM3, 16 QWORD QWORD 16 MM0 QWORD MM0 QWORD MM0 MM2 MM2 QWORD QWORD MM1 MM0 QWORD MM1 MM2 QWORD MM1 MM4 MOVQ PFMUL PFADD MM4, QWORD PTR MM2, MM1 MM5, MM7 PTR PTR PTR PTR PTR PTR MOVQ MM1, QWORD PTR PUNP
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Efficient 3D-Clipping Code Computation Using 3DNow!™ Instructions Clipping is one of the major activities occurring in a 3D graphics pipeline. In many instances, this activity is split into two parts which do not necessarily have to occur consecutively: ■ ■ Computation of the clip code for each vertex, where each bit of the clip code indicates whether the vertex is outside the frustum with regard to a specific clip plane.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 ;; ;; PXOR MOVQ MOVQ PUNPCKHDQ MOVQ MOVQ PFSUBR PFSUBR PUNPCKLDQ PFCMPGT MOVQ PFCMPGT PFCMPGT MOVQ PAND MOVQ PAND PAND POR POR MOVQ PUNPCKHDQ POR DESTROYS MM0, MM1, MM4, MM1, MM3, MM2, MM3, MM2, MM3, MM4, MM0, MM3, MM2, MM1, MM4, MM0, MM3, MM2, MM2, MM2, MM1, MM2, MM2, MM0,MM1,MM2,MM3,MM4 MM0 ; 0 | 0 MM6 ; w | z MM5 ; y | x MM1 ; w | w MM6 ; w | z MM5 ; y | x MM0 ; -w | -z MM0 ; -y | -x MM6 ; z | -z MM1 ; y>w?FFFFFFFF:0 | x>w?FFFFFFFF:0
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Example 1 (Avoid): 124 MOV MOV MOV MOV MOVQ MOVQ MOV ESI, EDI, EDX, EBX, MM7, MM6, ECX, DWORD DWORD DWORD DWORD QWORD QWORD 16 PTR PTR PTR PTR PTR PTR L1: MOVQ MOVQ MOVQ MOVQ PAND PAND PAND PAND POR PSRLQ PSRLQ PAND PADDB MM0, MM1, MM2, MM3, MM2, MM3, MM0, MM1, MM2, MM0, MM1, MM2, MM0, [ESI] [EDI] MM0 MM1 MM6 MM6 MM7 MM7 MM3 1 1 MM6 MM1 PADDB MOVQ MOVQ MOVQ MOVQ MOVQ PAND PAND PAND PAND POR PSRLQ PSRLQ PAND PADDB MM0, MM2 [EDI],
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 The following code fragment uses the 3DNow! PAVGUSB ins tr uction to perfor m averaging between the source macroblock and destination macroblock: Example 2 (Preferred): MOV MOV MOV MOV MOV EAX, EDI, EDX, EBX, ECX, DWORD DWORD DWORD DWORD 16 PTR PTR PTR PTR L1: MOVQ MOVQ PAVGUSB MM0, [EAX] MM1, [EAX+8] MM0, [EDI] PAVGUSB MM1, [EDI+8] ADD MOVQ MOVQ ADD LOOP EAX, EDX [EDI], MM0 [EDI+8], MM1 EDI, EBX L1 Src_MB Dst_MB SrcStride DstStr
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Complex Number Arithmetic Complex numbers have a “real” part and an “imaginary” part. Multiplying complex numbers (ex. 3 + 4i) is an integral part of many algorithms such as Discrete Fourier Transform (DFT) and complex FIR filters. Complex number multiplication is shown below: (src0.real + src0.imag) * (src1.real + src1.imag) = result result = (result.real + result.imag) result.real <= src0.real*src1.real - src0.imag*src1.imag result.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 11 General x86 Optimization Guidelines This chapter describes general code optimization techniques specific to superscalar processors (that is, techniques common to the AMD-K6 ® processor, AMD Athlon™ processor, and Pentium ® family processors).
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Dependencies Spread out true dependencies to increase the opportunities for p a ra l l e l e x e c u t i o n . A n t i -d e p e n d e n c i e s a n d o u t p u t dependencies do not impact performance. Register Operands Maintain frequently used values in registers rather than in memory. This technique avoids the comparatively long latencies for accessing memory.
22007E/0—November 1999 AMD Athlon™ Processor x86 Code Optimization Appendix A AMD Athlon™ Processor Microarchitecture Introduction When discussing processor design, it is important to understand the following terms—architecture, microarchitecture, and design implementation. The term architecture refers to the instruction set and features of a processor that are visible to software p rog ra m s r u n n ing o n t h e p ro c e s so r. The a rchi t ec t ure de termines w hat software the processor can run.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 AMD Athlon™ Processor Microarchitecture The innovative AMD Athlon processor microarchitecture approach implements the x86 instruction set by processing simpler operations (OPs) instead of complex x86 instructions. These OPs are specially designed to include direct support for the x86 instructions while observing the high-performance principles of fixed-length encoding, regularized instruction fields, and a large register set.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 2-Way, 64-Kbyte Instruction Cache 24-Entry L1 TLB/256-Entry L2 TLB Fetch/Decode Control Predecode Cache Branch Prediction Table 3-Way x86 Instruction Decoders Instruction Control Unit (72-Entry) FPU Stack Map / Rename Integer Scheduler (18-Entry) FPU Scheduler (36-Entry) FPU Register File (88-Entry) Bus Interface Unit IEU0 AGU0 IEU1 AGU1 IEU2 AGU2 FADD MMX™ 3DNow!™ FMUL MMX 3DNow! FSTORE L2 Cache Controller Load / Store Queu
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 re p l a c e m e n t i s b a s e d o n a l e a s t -re c e n t ly u s e d ( L RU ) replacement algorithm. The L1 instruction cache has an associated two-level translation look-aside buffer (TLB) structure. The first-level TLB is fully associative and contains 24 entries (16 that map 4-Kbyte pages and eight that map 2-Mbyte or 4-Mbyte pages).
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 return stack. Subsequent RETs pop a predicted return address off the top of the stack. Early Decoding T h e D i re c t Pa t h a n d Ve c t o r Pa t h d e c o d e r s p e r f o r m early-decoding of instructions into MacroOPs. A MacroOP is a fixed length instruction which contains one or more OPs. The output s of the early decoders keep all (D irectPat h o r VectorPath) instructions in program order.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Instruction Control Unit The instruction control unit (ICU) is the control center for the AMD Athlon processor. The ICU controls the following resources—the centralized in-flight reorder buffer, the integer scheduler, and the floating-point scheduler.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Integer Scheduler The integer scheduler is based on a three-wide queuing system (also known as a reservation station) that feeds three integer execution positions or pipes. The reservation stations are six entries deep, for a total queuing system of 18 integer MacroOPs.Each reservation station divides the MacroOPs into integer and address generation OPs, as required.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Each of the three IEUs are general purpose in that each performs logic functions, arithmetic functions, conditional functions, divide step functions, status flag multiplexing, and branch resolutions. The AGUs calculate the logical addresses for loads, stores, and LEAs. A load and store unit reads and writes data to and from the L1 data cache.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Floating-Point Execution Unit The floating-point execution unit (FPU) is implemented as a coprocessor that has its own out-of-order control in addition to the data path. The FPU handles all register operations for x87 instructions, all 3DNow! operations, and all MMX operations. The FPU consists of a stack renaming unit, a register renaming unit, a scheduler, a register file, and three parallel execution units.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Load-Store Unit (LSU) The load-store unit (LSU) manages data load and store accesses to the L1 data cache and, if required, to the backside L2 cache or system memory. The 44-entry LSU provides a data interface for both the integer scheduler and the floating-point scheduler. It consists of two queues—a 12-entry queue for L1 cache load and store accesses and a 32-entry queue for L2 cache or system memory load and store accesses.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 L2 Cache Controller The AMD Athlon processor contains a very flexible onboard L2 controller. It uses an independent backside bus to access up to 8-Mbytes of industry-standard SRAMs. There are full on-chip tags for a 512-Kbyte cache, while larger sizes use a partial tag system. In addition, there is a two-level data TLB structure.
AMD Athlon™ Processor x86 Code Optimization 140 22007E/0—November 1999 AMD Athlon™ Processor Microarchitecture
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Appendix B Pipeline and Execution Unit Resources Overview The AMD Athlon™ processor contains two independent execution pipelines — one for integer operations and one for floating-point operations. The integer pipeline manages x86 integer operations and the floating-point pipeline manages all x87, 3DNow!™ and MMX™ instructions. This appendix describes the operation and functionality of these pipelines.
AMD Athlon™ Processor x86 Code Optimization E n try P o in t D ec o d e V ec to rP ath 22007E/0—November 1999 D ec o d e MROM D ec o d e I-C A C H E D ec o d e 1 6 b yte s D ire ctP a th D ec o d e D ec o d e D ec o d e D ec o d e D ec o d e D ec o d e 3 M a cro O p s Q u ad w o rd Q u eu e FETCH S C A N A L IG N 1 / M ECTL 1 2 A L IG N 2/ MEROM 3 EDEC/ MEDEC 4 5 ID E C 6 Figure 5.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Cycle 1–FETCH The FETCH pipeline stage calculates the address of the next x86 instruction window to fetch from the processor caches or system memory. Cycle 2–SCAN SCAN determines the start and end pointers of instructions. SCAN can send up to six aligned instructions (DirectPath and VectorPath) to ALIGN1 and only one VectorPath instruction to the microcode engine (MENG) per cycle.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 operands mapped to registers. Both integer and floating-point MacroOPs are placed into the ICU. Integer Pipeline Stages The integer execution pipeline consists of four or more stages for scheduling and execution and, if necessary, accessing data in the processor caches or system memory. There are three integer pipes associated with the three IEUs.
22007E/0—November 1999 AMD Athlon™ Processor x86 Code Optimization Cycle 7–SCHED In the scheduler (SCHED) pipeline stage, the scheduler buffers can contain MacroOPs that are waiting for integer operands from the ICU or the IEU result bus. When all operands are received, SCHED schedules the MacroOP for execution and issues the OPs to the next stage, EXEC. Cycle 8–EXEC In the execution (EXEC) pipeline stage, the OP and its associated operands are processed by an integer pipe (either the IEU or the AGU).
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Floating-Point Pipeline Stages The floating-point unit (FPU) is implemented as a coprocessor that has its own out-of-order control in addition to the data path. The FPU handles all register operations for x87 instructions, all 3DNow! operations, and all MMX operations. The FPU consists of a stack renaming unit, a register renaming unit, a scheduler, a register file, and three parallel execution units.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Cycle 7–STKREN The stack rename (STKREN) pipeline stage in cycle 7 receives up to three MacroOPs from IDEC and maps stack-relative register tags to virtual register tags. Cycle 8–REGREN The register renaming (REGREN) pipeline stage in cycle 8 is responsible for register renaming. In this stage, virtual register tags are mapped into physical register tags. Likewise, each destination is assigned a new physical register.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Execution Unit Resources Terminology The execution units operate with two types of register values— operands and results. There are three operand types and two result types, which are described in this section.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Integer Pipeline Operations Table 2 shows the category or type of operations handled by the integer pipeline. Table 3 shows examples of the decode type. Table 2. Integer Pipeline Operation Types Category Execution Unit Integer Memory Load or Store Operations L/S Address Generation Operations AGU Integer Execution Unit Operations IEU Integer Multiply Operations Table 3.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Floating-Point Pipeline Operations Table 4 shows the category or type of operations handled by the floating-point execution units. Table 5 shows examples of the decode types. Table 4. Floating-Point Pipeline Operation Types Category Execution Unit FPU/3DNow!/MMX Load/store or Miscellaneous Operations FSTORE FPU/3DNow!/MMX Multiply Operation FMUL FPU/3DNow!/MMX Arithmetic Operation FADD Table 5.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Load/Store Pipeline Operations The AMD Athlon processor decodes any instruction that references memory into primitive load/store operations.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Code Sample Analysis The samples in Table 7 on page 153 and Table 8 on page 154 show the execution behavior of several series of instructions as a function of decode constraints, dependencies, and execution resource constraints. The sample tables show the x86 instructions, the decode pipe in the integer execution pipeline, the decode type, the clock counts, and a description of the events occurring within the processor.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 7.
AMD Athlon™ Processor x86 Code Optimization Table 8.
22007E/0—November 1999 AMD Athlon™ Processor x86 Code Optimization Appendix C Implementation of Write Combining Introduction This appendix describes the memory write-combining feature as implemented in the AMD Athlon™ processor family.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Write-Combining Definitions and Abbreviations This appendix uses the following definitions and abbreviations: ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ UC—Uncacheable memory type WC—Write-combining memory type WT—Writethrough memory type WP—Write-protected memory type WB—Writeback memory type One Byte—8 bits One Word—16 bits Longword—32 bits (same as a x86 doubleword) Quadword—64 bits or 2 longwords Octaword—128 bits or 2 quadwords Cache Block—64 bytes or 4
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 signature in register EAX, where EAX[11–8] contains the instruction family code. For the AMD Athlon processor, the instruction family code is six. 2. In addition, the presence of the MTRRs is indicated by bit 12 and the presence of the PAT extension is indicated by bit 16 of the extended features bits returned in the EDX register by CPUID function 8000_0001h.
AMD Athlon™ Processor x86 Code Optimization Table 9. 22007E/0—November 1999 Write Combining Completion Events Event Comment The first non-WB write to a different cache block address closes combining for previous writes. WB writes do not affect Non-WB write outside of write combining. Only one line-sized buffer can be open for current buffer write combining at a time. Once a buffer is closed for write combining, it cannot be reopened for write combining.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Sending Write-Buffer Data to the System Once write combining is closed for a 64-byte write buffer, the contents of the write buffer are eligible to be sent to the system as one or more AMD Athlon system bus commands. Table 10 lists the rules for determining what system commands are issued for a write buffer, as a function of the alignment of the valid buffer data. Table 10. AMD Athlon™ System Bus Commands Generation Rules 1.
AMD Athlon™ Processor x86 Code Optimization 160 22007E/0—November 1999 Write-Combining Operations
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Appendix D Performance-Monitoring Counters This chapter describes how to use the AMD Athlon™ processor performance monitoring counters. Overview The AMD Athlon processor provides four 48-bit performance counters, which allows four types of events to be monitored simultaneously. These counters can either count events or measure duration.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 These registers can be read from and written to using the RDMSR and WRMSR instructions, respectively. The PerfEvtSel[3:0] registers are located at MSR locations C001_0000h to C001_0003h. The PerfCtr[3:0] registers are located at MSR locations C001_0004h to C0001_0007h and are 64-byte registers. The PerfEvtSel[3:0] registers can be accessed using the RDMSR/WRMSR instructions only when operating at privilege level 0.
22007E/0—November 1999 AMD Athlon™ Processor x86 Code Optimization Unit Mask Field (Bits 8—15) These bits are used to further qualify the event selected in the event select field. For example, for some cache events, the mask is used as a MESI-protocol qualifier of cache states. See Table 11 on page 164 for a list of unit masks and their 8-bit codes. USR (User Mode) Flag (Bit 16) Events are counted only when the processor is operating at privilege levels 1, 2 or 3.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 greater than or equal to the counter mask. Otherwise if this field is zero, then the counter increments by the total number of events. Table 11.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 11.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 11.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 11.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 allows writing both positive and negative values to the performance counters. The performance counters may be initialized using a 64-bit signed integer in the range -247 and +247 . Negative values are useful for generating an interrupt after a specific number of events.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 The initialization and start counters procedure sets the PerfEvtSel0 and/or PerfEvtSel1 MSRs for the events to be counted and the method used to count them and initializes the counter MSRs (PerfCtr[3:0]) to starting counts. The stop counters procedure stops the performance counters. (See “Starting and Stopping the Performance-Monitoring Counters” on page 168 for more information about starting and stopping the counters.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 An event monitor application utility or another application program can read the collected performance information of the profiled application.
22007E/0—November 1999 AMD Athlon™ Processor x86 Code Optimization Appendix E Programming the MTRR and PAT Introduction The AMD Athlon™ processor includes a set of memory type and range registers (MTRRs) to control cacheability and access to specified memory regions. The processor also includes the Page Address Table for defining attributes of pages. This chapter documents the use and capabilities of this feature.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 There are two types of address ranges: fixed and variable. (See Figure 12.) For each address range, there is a memory type. For each 4K, 16K or 64K segment within the first 1 Mbyte of memory, there is one fixed address MTRR. The fixed address ranges all exist in the first 1 Mbyte. There are eight variable address ranges above 1 Mbytes. Each is programmed to a specific memory starting address, size and alignment.
22007E/0—November 1999 AMD Athlon™ Processor x86 Code Optimization FFFFFFFFh SMM TSeg 0-8 Variable Ranges (212 to 232) 64 Fixed Ranges (4 Kbytes each) 16 Fixed Ranges (16 Kbytes each) 8 Fixed Ranges (64 Kbytes each) 256 Kbytes 256 Kbytes 100000h C0000h 80000h 512 Kbytes 0 Figure 12.
AMD Athlon™ Processor x86 Code Optimization Memory Types 22007E/0—November 1999 Five standard memory types are defined by the AMD Athlon processor: writethrough (WT), writeback (WB), write-protect (WP), write-combining (WC), and uncacheable (UC). These are described in Table 12 on page 174. Table 12. Memory Type Encodings Type Number Type Name 00h UC—Uncacheable 01h WC—Write-Combining Uncacheable for reads or writes. Can be combined. Can be speculative for reads. Writes can never be speculative.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 MTRR Default Type Register Format. The MTRR default type register is defined as follows. 63 11 10 9 8 E F E 7 3 2 1 0 Type Reserved Symbol E FE Type Description MTRRs Enabled Fixed Range Enabled Default Memory Type Bits 11 10 7–0 Figure 14. MTRR Default Type Register Format E MTRRs are enabled when set.
AMD Athlon™ Processor x86 Code Optimization Table 13.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 not affected by this issue, only the variable range (and MTRR DefType) registers are affected. Page Attribute Table (PAT) The Page Attribute Table (PAT) is an extension of the page table entry format, which allows the specification of memory types to regions of physical memory based on the linear address. The PAT provides the same functionality as MTRRs with the flexibility of the page tables.
AMD Athlon™ Processor x86 Code Optimization Accessing the PAT 22007E/0—November 1999 A 3-bit index consisting of the PATi, PCD, and PWT bits of the page table entry, is used to select one of the seven PAT register fields to acquire the memory type for the desired page (PATi is defined as bit 7 for 4-Kbyte PTEs and bit 12 for PDEs which map to 2-Mbyte or 4-Mbyte pages). The memory type from the PAT is used instead of the PCD and PWT for the effective memory type.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 15. Effective Memory Type Based on PAT and MTRRs PAT Memory Type MTRR Memory Type Effective Memory Type UC- WB, WT, WP, WC UC-Page UC UC-MTRR WC x WC WT WB, WT WT UC UC WC CD WP CD WB, WP WP UC UC-MTRR WC, WT CD WB WB UC UC WC WC WT WT WP WP WP WB Notes: 1. UC-MTRR indicates that the UC attribute came from the MTRRs and that the processor caches should not be probed for performance reasons. 2.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 16. Final Output Memory Types WrMem Effective.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 16. Final Output Memory Types (Continued) WrMem Effective. MType forceCD5 RdMem WrMem MemType Output Memory Type RdMem Input Memory Type ● ● CD - ● ● CD ● ● WC - ● ● WC ● ● WT - ● ● WT ● ● WP - ● ● WP ● ● WB - ● ● WT 4 ● ● - ● ● ● CD 2 AMD-751 Note Notes: 1. WP is not functional for RdMem/WrMem. 2. ForceCD must cause the MTRR memory type to be ignored in order to avoid x’s. 3.
AMD Athlon™ Processor x86 Code Optimization MTRR Fixed-Range Register Format 22007E/0—November 1999 The memory types defined for memory segments defined in each of the MTRR fixed-range registers are defined in Table 17 (Also See “Standard MTRR Types and Propert ies” on page 176.). Table 17.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Variable-Range MTRRs A variable MTRR can be programmed to start at address 0000_0000h because the fixed MTRRs always override the variable ones. However, it is recommended not to create an overlap. The upper two variable MTRRs should not be used by the BIOS and are reserved for operating system use. Variable-Range MTRR Register Format The variable address range is power of 2 sized and aligned.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 36 35 63 12 11 10 Physical Mask 0 V Reserved Symbol Description Bits Physical Mask 24-Bit Mask 35–12 V Variable Range Register Pair Enabled 11 (V = 0 at reset) Figure 17. MTRRphysMaskn Register Format Note: A software attempt to write to reserved bits will generate a general protection exception. Physical Mask Specifies a 24-bit mask to determine the range of the region defined in the register pair.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 MTRR MSR Format This table defines the model-specific registers related to the memory type range register implementation. All MTRRs are defined to be 64 bits. Table 18. MTRR-Related Model-Specific Register (MSR) Map Register Address Register Name 0FEh MTRRcap See “MTRR Capability Register Format” on page 174. 200h MTRR Base0 See “MTRRphysBasen Register Format” on page 183.
AMD Athlon™ Processor x86 Code Optimization 186 22007E/0—November 1999 Page Attribute Table (PAT)
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Appendix F Instruction Dispatch and Execution Resources This chapter describes the MacroOPs generated by each decoded instruction, along with the relative static execution latencies of these groups of operations. Tables 19 through 24 starting on page 188 define the integer, MMX™, MMX extensions, floating-point, 3DNow!™, and 3DNow! extensions instructions, respectively.
AMD Athlon™ Processor x86 Code Optimization ■ ■ ■ ■ ■ ■ ■ ■ ■ 22007E/0—November 1999 disp16/32—16-bit or 32-bit displacement value disp32/48—32-bit or 48-bit displacement value eXX—register width depending on the operand size mem32real—32-bit floating-point value in memory mem64real—64-bit floating-point value in memory mem80real—80-bit floating-point value in memory mmreg—MMX/3DNow! register mmreg1—MMX/3DNow! register defined by bits 5, 4, and 3 of the modR/M byte mmreg2—MMX/3DNow! register defined by
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 19.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 20.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 20.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 20.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 20.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 21.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 22.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 22.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 22.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 22.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 23.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 23.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Appendix G DirectPath versus VectorPath Instructions Select DirectPath Over VectorPath Instructions U s e D i r e c t Pa t h i n s t r u c t i o n s r a t h e r t h a n Ve c t o r Pa t h instructions.
AMD Athlon™ Processor x86 Code Optimization Table 25. DirectPath Integer Instructions 22007E/0—November 1999 Table 25.
22007E/0—November 1999 AMD Athlon™ Processor x86 Code Optimization Table 25. DirectPath Integer Instructions (Continued) Table 25.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 25. DirectPath Integer Instructions (Continued) Table 25.
22007E/0—November 1999 AMD Athlon™ Processor x86 Code Optimization Table 25. DirectPath Integer Instructions (Continued) Table 25.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 25. DirectPath Integer Instructions (Continued) Table 25.
22007E/0—November 1999 AMD Athlon™ Processor x86 Code Optimization Table 25. DirectPath Integer Instructions (Continued) Table 25.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 25.
22007E/0—November 1999 Table 26. DirectPath MMX™ Instructions Instruction Mnemonic AMD Athlon™ Processor x86 Code Optimization Table 26.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 26. DirectPath MMX™ Instructions (Continued) Table 26. DirectPath MMX™ Instructions (Continued) Instruction Mnemonic PSRLD mmreg, imm8 Instruction Mnemonic PXOR mmreg, mem64 PSRLQ mmreg1, mmreg2 PSRLQ mmreg, mem64 PSRLQ mmreg, imm8 PSRLW mmreg1, mmreg2 Table 27.
22007E/0—November 1999 Table 28. DirectPath Floating-Point Instructions AMD Athlon™ Processor x86 Code Optimization Table 28.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 28.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 VectorPath Instructions The following tables contain VectorPath instructions, which should be avoided in the AMD Athlon processor: ■ ■ ■ Table 29, “VectorPath Integer Instructions,” on page 231 Table 30, “VectorPath MMX™ Instructions,” on page 234 and Table 31, “VectorPath MMX™ Extensions,” on page 234 Table 32, “VectorPath Floating-Point Instructions,” on page 235 Table 29. VectorPath Integer Instructions Table 29.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 29. VectorPath Integer Instructions (Continued) Table 29.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 29. VectorPath Integer Instructions (Continued) Table 29.
AMD Athlon™ Processor x86 Code Optimization Table 29. VectorPath Integer Instructions (Continued) 22007E/0—November 1999 Table 30. VectorPath MMX™ Instructions Instruction Mnemonic Instruction Mnemonic STI MOVD mmreg, mreg32 STOSB mem8, AL MOVD mreg32, mmreg STOSW mem16, AX STOSD mem32, EAX Table 31.
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Table 32. VectorPath Floating-Point Instructions Table 32.
AMD Athlon™ Processor x86 Code Optimization 236 22007E/0—November 1999 VectorPath Instructions
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 Index Numerics D 3DNow!™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . 10, 107 3DNow! and MMX™ Intra-Operand Swapping . . . . . . . 112 Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Fast Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Fast Square Root and Reciprocal Square Root . . . . . . . 110 FEMMS Instruction . . . . . . . . . . . . . .
AMD Athlon™ Processor x86 Code Optimization Instruction Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Dispatch and Execution Resources. . . . . . . . . . . . . . . . . 187 Short Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Short Lengths . . . . . . .
AMD Athlon™ Processor x86 Code Optimization 22007E/0—November 1999 T W TBYTE Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Trigonometric Instructions . . . . . . . . . . . . . . . . . . . . . . . . . 103 Write Combining . . . . . . . . . . . . . . 10, 50, 139, 155–157, 159 V VectorPath Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 VectorPath Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
AMD Athlon™ Processor x86 Code Optimization 240 22007E/0—November 1999 Index