Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 1: Basic Architecture NOTE: The Intel® 64 and IA-32 Architectures Software Developer's Manual consists of five volumes: Basic Architecture, Order Number 253665; Instruction Set Reference A-M, Order Number 253666; Instruction Set Reference N-Z, Order Number 253667; System Programming Guide, Part 1, Order Number 253668; System Programming Guide, Part 2, Order Number 253669. Refer to all five volumes when evaluating your design needs.
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT.
CONTENTS PAGE CHAPTER 1 ABOUT THIS MANUAL 1.1 INTEL® 64 AND IA-32 PROCESSORS COVERED IN THIS MANUAL . . . . . . . . . . . . . . . . . . . . . . 1.2 OVERVIEW OF VOLUME 1: BASIC ARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 NOTATIONAL CONVENTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Bit and Byte Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS PAGE 2.2.8 2.3 Intel Virtualization Technology (Intel VT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-20 INTEL® 64 AND IA-32 PROCESSOR GENERATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-20 ® ® CHAPTER 3 BASIC EXECUTION ENVIRONMENT 3.1 MODES OF OPERATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1 3.1.1 Intel® 64 Architecture . . . . . . . . . . . . . .
CONTENTS PAGE CHAPTER 4 DATA TYPES 4.1 FUNDAMENTAL DATA TYPES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1 4.1.1 Alignment of Words, Doublewords, Quadwords, and Double Quadwords . . . . . . . . . . . . 4-2 4.2 NUMERIC DATA TYPES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3 4.2.1 Integers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS PAGE 5.1.1 5.1.2 5.1.3 5.1.4 5.1.5 5.1.6 5.1.7 5.1.8 5.1.9 5.1.10 5.1.11 5.1.12 5.1.13 5.2 5.2.1 5.2.2 5.2.3 5.2.4 5.2.5 5.2.6 5.3 5.4 5.4.1 5.4.2 5.4.3 5.4.4 5.4.5 5.4.6 5.4.7 5.5 5.5.1 5.5.1.1 5.5.1.2 5.5.1.3 5.5.1.4 5.5.1.5 5.5.1.6 5.5.2 5.5.3 5.5.4 5.6 5.6.1 5.6.1.1 5.6.1.2 5.6.1.3 5.6.1.4 5.6.1.5 vi Vol. 1 Data Transfer Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3 Binary Arithmetic Instructions . . . . . . . .
CONTENTS PAGE 5.6.1.6 5.6.2 5.6.3 5.6.4 5.7 5.7.1 5.7.2 5.7.3 5.7.4 5.7.5 5.7.6 5.8 5.8.1 5.8.2 5.8.3 5.8.4 5.8.5 5.8.6 5.8.7 5.9 5.10 5.11 SSE2 Conversion Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-23 SSE2 Packed Single-Precision Floating-Point Instructions . . . . . . . . . . . . . . . . . . . . . . . . .5-24 SSE2 128-Bit SIMD Integer Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS PAGE 6.4.2 6.4.3 6.4.4 6.4.5 6.4.6 6.5 6.5.1 6.5.2 Calls to Interrupt or Exception Handler Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-17 Interrupt and Exception Handling in Real-Address Mode. . . . . . . . . . . . . . . . . . . . . . . . . . 6-17 INT n, INTO, INT 3, and BOUND Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-18 Handling Floating-Point Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS PAGE 7.3.8.5 7.3.9 7.3.9.1 7.3.10 7.3.10.1 7.3.11 7.3.12 7.3.13 7.3.14 7.3.14.1 7.3.14.2 7.3.14.3 7.3.15 7.3.16 7.3.16.1 7.3.16.2 7.3.16.3 7.3.16.4 7.3.17 7.3.17.1 7.3.17.2 7.3.17.3 7.3.17.4 Software Interrupt Instructions in 64-bit Mode and Compatibility Mode . . . . . . . .7-25 String Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-25 Repeating String Operations . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS PAGE 8.2 8.2.1 8.2.2 8.3 8.3.1 8.3.2 8.3.3 8.3.4 8.3.5 8.3.6 8.3.6.1 8.3.7 8.3.8 8.3.9 8.3.10 8.3.11 8.3.12 8.3.13 8.4 8.4.1 8.5 8.5.1 8.5.1.1 8.5.1.2 8.5.2 8.5.3 8.5.4 8.5.5 8.5.6 8.6 8.7 8.7.1 8.7.2 8.7.3 X87 FPU DATA TYPES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18 Indefinites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS PAGE 9.4.1 9.4.2 9.4.3 9.4.4 9.4.5 9.4.6 9.4.7 9.4.8 9.5 9.5.1 9.6 9.6.1 9.6.2 9.6.3 9.6.4 9.6.5 9.6.6 9.6.7 9.6.8 9.6.9 Data Transfer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-8 Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-8 Comparison Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS PAGE 10.4.6.3 PREFETCHh Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-19 10.4.6.4 SFENCE Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-20 10.5 FXSAVE AND FXRSTOR INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-20 10.6 HANDLING SSE INSTRUCTION EXCEPTIONS . . . . . . . . . . . . . . . . . . . . .
CONTENTS PAGE 11.6.1 11.6.2 11.6.3 11.6.4 11.6.5 11.6.6 11.6.7 11.6.8 11.6.9 11.6.10 11.6.10.1 11.6.10.2 11.6.10.3 11.6.11 11.6.12 11.6.13 11.6.14 General Guidelines for Using SSE/SSE2 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Checking for SSE/SSE2 Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Checking for the DAZ Flag in the MXCSR Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS PAGE 12.6.5 12.6.6 12.6.7 12.7 12.7.1 12.7.2 12.8 12.8.1 12.8.2 12.8.3 Packed Shuffle Bytes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-12 Packed Sign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-13 Packed Align Right . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS PAGE APPENDIX D GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS D.1 MS-DOS COMPATIBILITY SUB-MODE FOR HANDLING X87 FPU EXCEPTIONS. . . . . . . . . . . D-2 D.2 IMPLEMENTATION OF THE MS-DOS COMPATIBILITY SUB-MODE IN THE INTEL486, PENTIUM, AND P6 PROCESSOR FAMILY, AND PENTIUM 4 PROCESSORS . . . . D-3 D.2.1 MS-DOS Compatibility Sub-mode in the Intel486 and Pentium Processors. . . . . . . . . . . D-3 D.2.1.1 Basic Rules: When FERR# Is Generated . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS PAGE E.4.2.2 E.4.2.3 E.4.3 xvi Vol. 1 Results of Operations with NaN Operands or a NaN Result for SSE/SSE2/SSE3 Numeric Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .E-7 Condition Codes, Exception Flags, and Response for Masked and Unmasked Numeric Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-12 Example SIMD Floating-Point Emulation Implementation. . . . .
CONTENTS PAGE FIGURES Figure 1-1. Figure 1-2. Figure 2-1. Figure 2-2. Figure 2-3. Figure 2-4. Figure 2-5. Figure 2-6. Figure 3-1. Figure 3-2. Figure 3-3. Figure 3-4. Figure 3-5. Figure 3-6. Figure 3-7. Figure 3-8. Figure 3-9. Figure 3-10. Figure 3-11. Figure 4-1. Figure 4-2. Figure 4-3. Figure 4-4. Figure 4-5. Figure 4-6. Figure 4-7. Figure 4-8. Figure 4-9. Figure 4-10. Figure 4-11. Figure 4-12. Figure 6-1. Figure 6-2. Figure 6-3. Figure 6-4. Figure 6-5. Figure 6-6. Figure 6-7. Figure 6-8. Figure 6-9.
CONTENTS PAGE Figure 7-3. Figure 7-4. Figure 7-5. Figure 7-6. Figure 7-7. Figure 7-8. Figure 7-9. Figure 7-10. Figure 7-11. Figure 8-1. Figure 8-2. Figure 8-3. Figure 8-4. Figure 8-5. Figure 8-6. Figure 8-7. Figure 8-8. Figure 8-9. Figure 8-10. Figure 8-11. Figure 8-12. Figure 8-13. Figure 9-1. Figure 9-2. Figure 9-3. Figure 9-4. Figure 10-1. Figure 10-2. Figure 10-3. Figure 10-4. Figure 10-5. Figure 10-6. Figure 10-7. Figure 10-8. Figure 10-9. Figure 11-1. Figure 11-2. Figure 11-3. Figure 11-4.
CONTENTS PAGE Figure 13-1. Figure 13-2. Figure D-1. Figure D-2. Figure D-3. Figure D-4. Figure D-5. Figure D-6. Figure E-1. Memory-Mapped I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13-3 I/O Permission Bit Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13-6 Recommended Circuit for MS-DOS Compatibility x87 FPU Exception Handling . . . . . . . . . . . . . . . . . . . . .
CONTENTS PAGE TABLES Table 2-1. Table 2-2. Table 2-3. Table 3-1. Table 3-2. Table 3-3. Table 3-4. Table 3-5. Table 4-1. Table 4-2. Table 4-3. Table 4-4. Table 4-5. Table 4-6. Table 4-7. Table 4-8. Table 4-9. Table 4-10. Table 4-11. Table 5-1. Table 6-1. Table 7-1. Table 7-2. Table 7-3. Table 7-4. Table 8-1. Table 8-2. Table 8-3. Table 8-4. Table 8-5. Table 8-6. Table 8-7. Table 8-8. Table 8-9. Table 8-10. Table 8-11. Table 9-1. Table 9-2. Table 9-3. Table 10-1. Table 11-1. Table 11-2. xx Vol.
CONTENTS PAGE Table 11-3. Table 13-1. Table A-1. Table A-2. Table B-1. Table C-1. Table C-2. Table C-3. Table C-4. Table C-5. Table E-1. Table E-2. Table E-3. Table E-4. Table E-5. Table E-6. Table E-7. Table E-8. Table E-9. Table E-10. Table E-11. Table E-12. Table E-13. Table E-14. Table E-15. Table E-16. Table E-17. Table E-18. Effect of Prefixes on SSE, SSE2, and SSE3 Instructions . . . . . . . . . . . . . . . . . . . . . . 11-38 I/O Instruction Serialization . . . . . . . . . . . . . . . . . . . . .
CONTENTS PAGE xxii Vol.
CHAPTER 1 ABOUT THIS MANUAL The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture (order number 253665) is part of a set that describes the architecture and programming environment of Intel® 64 and IA-32 architecture processors. Other volumes in this set are: • The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 2A & 2B: Instruction Set Reference (order numbers 253666 and 253667).
ABOUT THIS MANUAL • • • Dual-Core Intel® Xeon® processor LV Intel® CoreTM2 Duo processor Intel® Xeon® processor 5100 series P6 family processors are IA-32 processors based on the P6 family microarchitecture. This includes the Pentium® Pro, Pentium® II, Pentium® III, and Pentium® III Xeon® processors. The Pentium® 4, Pentium® D, and Pentium® processor Extreme Editions are based on the Intel NetBurst® microarchitecture. Most early Intel® Xeon® processors are based on the Intel NetBurst® microarchitecture.
ABOUT THIS MANUAL Chapter 4 — Data Types. Describes the data types and addressing modes recognized by the processor; provides an overview of real numbers and floating-point formats and of floating-point exceptions. Chapter 5 — Instruction Set Summary. Lists all Intel 64 and IA-32 instructions, divided into technology groups. Chapter 6 — Procedure Calls, Interrupts, and Exceptions. Describes the procedure stack and mechanisms provided for making procedure calls and for servicing interrupts and exceptions.
ABOUT THIS MANUAL Appendix C — Floating-Point Exceptions Summary. Summarizes exceptions raised by the x87 FPU floating-point and SSE/SSE2/SSE3 floating-point instructions. Appendix D — Guidelines for Writing x87 FPU Exception Handlers. Describes how to design and write MS-DOS* compatible exception handling facilities for FPU exceptions (includes software and hardware requirements and assembly-language code examples). This appendix also describes general techniques for writing robust FPU exception handlers.
ABOUT THIS MANUAL 1.3.2 Reserved Bits and Software Compatibility In many register and memory layout descriptions, certain bits are marked as reserved. When bits are marked as reserved, it is essential for compatibility with future processors that software treat these bits as having a future, though unknown, effect. The behavior of reserved bits should be regarded as not only undefined, but unpredictable.
ABOUT THIS MANUAL For example: LOADREG: MOV EAX, SUBTOTAL In this example, LOADREG is a label, MOV is the mnemonic identifier of an opcode, EAX is the destination operand, and SUBTOTAL is the source operand. Some assembly languages put the source and destination in reverse order. 1.3.3 Hexadecimal and Binary Numbers Base 16 (hexadecimal) numbers are represented by a string of hexadecimal digits followed by the character H (for example, 0F82EH).
ABOUT THIS MANUAL 1.3.5 A New Syntax for CPUID, CR, and MSR Values Obtain feature flags, status, and system information by using the CPUID instruction, by checking control register bits, and by reading model-specific registers. We are moving toward a new syntax to represent this information. See Figure 1-2.
ABOUT THIS MANUAL 1.3.6 Exceptions An exception is an event that typically occurs when an instruction causes an error. For example, an attempt to divide by zero generates an exception. However, some exceptions, such as breakpoints, occur under other conditions. Some types of exceptions may provide error codes. An error code reports additional information about the error.
CHAPTER 2 INTEL 64 AND IA-32 ARCHITECTURES ® The exponential growth of computing power and ownership has made the computer one of the most important forces shaping business and society. Intel 64 and IA-32 architectures have been at the forefront of the computer revolution and is today the preferred computer architecture, as measured by computers in use and the total computing power available in the world. 2.
INTEL® 64 AND IA-32 ARCHITECTURES • • Read-only and execute-only segment options Four privilege levels 2.1.3 The Intel386™ Processor (1985) The Intel386 processor was the first 32-bit processor in the IA-32 architecture family. It introduced 32-bit registers for use both to hold operands and for addressing. The lower half of each 32-bit Intel386 register retains the properties of the 16-bit registers of earlier generations, permitting backward compatibility.
INTEL® 64 AND IA-32 ARCHITECTURES In addition, the processor added: • Extensions to make the virtual-8086 mode more efficient and allow for 4-MByte as well as 4-KByte pages • • • • Internal data paths of 128 and 256 bits add speed to internal data transfers Burstable external data bus was increased to 64 bits An APIC to support systems with multiple processors A dual processor mode to support glueless two processor systems A subsequent stepping of the Pentium family introduced Intel MMX technology (th
INTEL® 64 AND IA-32 ARCHITECTURES • The Intel Celeron processor family focused on the value PC market segment. Its introduction offers an integrated 128 KBytes of Level 2 cache and a plastic pin grid array (P.P.G.A.) form factor to lower system design cost. • The Intel Pentium III processor introduced the Streaming SIMD Extensions (SSE) to the IA-32 architecture.
INTEL® 64 AND IA-32 ARCHITECTURES 2.1.9 The Intel® Pentium® M Processor (2003-Current) The Intel Pentium M processor family is a high performance, low power mobile processor family with microarchitectural enhancements over previous generations of IA-32 Intel mobile processors. This family is designed for extending battery life and seamless integration with platform innovations that enable new usage models (such as extended mobility, ultra thin form-factors, and integrated wireless networking).
INTEL® 64 AND IA-32 ARCHITECTURES 2.1.11 The Intel® Core™ Duo and Intel® Core™ Solo Processors (2006-Current) The Intel Core Duo processor offers power-efficient, dual-core performance with a low-power design that extends battery life. This family and the single-core Intel Core Solo processor offer microarchitectural enhancements over Pentium M processor family.
INTEL® 64 AND IA-32 ARCHITECTURES 2.2.1 P6 Family Microarchitecture The Pentium Pro processor introduced a new microarchitecture commonly referred to as P6 processor microarchitecture. The P6 processor microarchitecture was later enhanced with an on-die, Level 2 cache, called Advanced Transfer Cache. The microarchitecture is a three-way superscalar, pipelined architecture.
INTEL® 64 AND IA-32 ARCHITECTURES coupled to the pipeline. The Level 2 cache provides 256-KByte, 512-KByte, or 1-MByte static RAM that is coupled to the core processor through a full clock-speed 64-bit cache bus. The centerpiece of the P6 processor microarchitecture is an out-of-order execution mechanism called dynamic execution.
INTEL® 64 AND IA-32 ARCHITECTURES • Advanced Dynamic Execution — Deep, out-of-order, speculative execution engine • • Up to 126 instructions in flight Up to 48 loads and 24 stores in pipeline1 — Enhanced branch prediction capability • • • • Reduces the misprediction penalty associated with deeper pipelines Advanced branch prediction algorithm 4K-entry branch target array New cache subsystem — First level caches • • Advanced Execution Trace Cache stores decoded instructions Execution Trace Cache r
INTEL® 64 AND IA-32 ARCHITECTURES System Bus Frequently used paths Less frequently used paths Bus Unit 3rd Level Cache Optional 2nd Level Cache 8-Way 1st Level Cache 4-way Front End Fetch/Decode Trace Cache Microcode ROM Execution Out-Of-Order Core Retirement Branch History Update BTBs/Branch Prediction OM16521 Figure 2-2. The Intel NetBurst Microarchitecture 2.2.2.1 The Front End Pipeline The front end supplies instructions in program order to the out-of-order execution core.
INTEL® 64 AND IA-32 ARCHITECTURES • wasted decode bandwidth due to branches or branch target in the middle of cache lines The operation of the pipeline’s trace cache addresses these issues. Instructions are constantly being fetched and decoded by the translation engine (part of the fetch/decode logic) and built into sequences of µops called traces. At any time, multiple traces (representing prefetched branches) are being stored in the trace cache.
INTEL® 64 AND IA-32 ARCHITECTURES 2.2.3 Intel® Core™ Microarchitecture Intel Core microarchitecture introduces the following features that enable high performance and power-efficient performance for single-threaded as well as multithreaded workloads: • Intel® Wide Dynamic Execution enable each processor core to fetch, dispatch, execute in high bandwidths to support retirement of up to four instructions per cycle.
INTEL® 64 AND IA-32 ARCHITECTURES Intel Core 2 Extreme, Intel Core 2 Duo processors and Intel Xeon processor 5100 series implement two processor cores based on the Intel Core microarchitecture, the functionality of the subsystems in each core are depicted in Figure 2-3. Figure 2-3. The Intel Core Microarchitecture Pipeline Functionality Instruction Fetch and P reD ecode Instruction Q ueue M icrocode ROM D ecode S hared L2 C ache U p to 10.
INTEL® 64 AND IA-32 ARCHITECTURES • • Instruction queue provides caching of short loops to improve efficiency. • Branch prediction unit employs dedicated hardware to handle different types of branches for improved branch prediction. • Advanced branch prediction algorithm directs instruction fetch unit to fetch instructions likely in the architectural code path for decoding. Stack pointer tracker improves efficiency of executing procedure/function entries and exits. 2.2.3.
INTEL® 64 AND IA-32 ARCHITECTURES byte, word, or doubleword integers located in MMX registers. These instructions are useful in applications that operate on integer arrays and streams of integer data that lend themselves to SIMD processing. SSE extensions were introduced in the Pentium III processor family. SSE instructions operate on packed single-precision floating-point values contained in XMM registers and on packed integers contained in MMX registers.
INTEL® 64 AND IA-32 ARCHITECTURES SIMD Extension Register Layout Data Type MMX Registers 8 Packed Byte Integers MMX Technology 4 Packed Word Integers 2 Packed Doubleword Integers Quadword MMX Registers SSE 8 Packed Byte Integers 4 Packed Word Integers 2 Packed Doubleword Integers Quadword XMM Registers 4 Packed Single-Precision Floating-Point Values MMX Registers SSE2/SSE3/SSSE3 2 Packed Doubleword Integers Quadword XMM Registers 2 Packed Double-Precision Floating-Point Values 16 Packed Byte Inte
INTEL® 64 AND IA-32 ARCHITECTURES 2.2.5 Hyper-Threading Technology Hyper-Threading (HT) Technology was developed to improve the performance of IA-32 processors when executing multi-threaded operating system and application code or single-threaded applications under multi-tasking environments. The technology enables a single physical processor to execute two or more separate code streams (threads) concurrently using shared execution resources.
INTEL® 64 AND IA-32 ARCHITECTURES engine and the system bus interface. After power up and initialization, each logical processor can be independently directed to execute a specified thread, interrupted, or halted. HT Technology leverages the process and thread-level parallelism found in contemporary operating systems and high-performance applications by providing two or more logical processors on a single chip.
INTEL® 64 AND IA-32 ARCHITECTURES ware multi-threading support with both two processor cores and Hyper-Threading Technology. This means that the Intel Pentium processor Extreme Edition provides four logical processors in a physical package (two logical processors for each processor core). The Dual-Core Intel Xeon processor features multi-core, HyperThreading Technology and supports multi-processor platforms. The Intel Pentium D processor also features multi-core technology.
INTEL® 64 AND IA-32 ARCHITECTURES • • 8 additional general-purpose registers (GPRs) • • • • 64-bit-wide GPRs and instruction pointers 8 additional registers for streaming SIMD extensions (SSE, SSE2, SSE3 and SSSE3) uniform byte-register addressing fast interrupt-prioritization mechanism a new instruction-pointer relative-addressing mode An Intel 64 architecture processor supports existing IA-32 software because it is able to run all non-64-bit legacy modes supported by IA-32 architecture.
INTEL® 64 AND IA-32 ARCHITECTURES Table 2-1. Key Features of Most Recent IA-32 Processors Intel Processor Date Introduced Microarchitecture Top-Bin Clock Frequency at Introduction Transistors Register Sizes1 System Bus Bandwidth Max. Extern. Addr. Space On-Die Caches2 Intel Pentium M Processor 7553 2004 Intel Pentium M Processor 2.00 GHz 140 M GP: 32 FPU: 80 MMX: 64 XMM: 128 3.
INTEL® 64 AND IA-32 ARCHITECTURES Table 2-2. Key Features of Most Recent Intel 64 Processors (Contd.) Intel Processor Date Introduced Microarchitecture Top-Bin Clock Frequency at Introduction Transistors Register Sizes System Bus Bandwidth Max. Extern. Addr. Space On-Die Caches Dual-Core Intel Xeon Processor 7041 2005 Intel NetBurst Microarchitecture; Hyper-Threading Technology; Intel 64 Architecture; Dual-core 3 3.00 GHz 321M GP: 32, 64 FPU: 80 MMX: 64 XMM: 128 6.
INTEL® 64 AND IA-32 ARCHITECTURES Table 2-3. Key Features of Previous Generations of IA-32 Processors Intel Processor Date Introduced Max. Clock Frequency/ Technology at Introduction Transistors Register Sizes1 Ext. Data Bus Size2 Max. Extern. Addr. Space Caches 8086 1978 8 MHz 29 K 16 GP 16 1 MB None Intel 286 1982 12.5 MHz 134 K 16 GP 16 16 MB Note 3 Intel386 DX Processor 1985 20 MHz 275 K 32 GP 32 4 GB Note 3 Intel486 DX Processor 1989 25 MHz 1.
INTEL® 64 AND IA-32 ARCHITECTURES 2-24 Vol.
CHAPTER 3 BASIC EXECUTION ENVIRONMENT This chapter describes the basic execution environment of an Intel 64 or IA-32 processor as seen by assembly-language programmers. It describes how the processor executes instructions and how it stores and manipulates data. The execution environment described here includes memory (the address space), generalpurpose data registers, segment registers, the flag register, and the instruction pointer register. 3.
BASIC EXECUTION ENVIRONMENT 3.1.1 Intel® 64 Architecture Intel 64 architecture adds IA-32e mode. IA-32e mode has two sub-modes. These are: • Compatibility mode (sub-mode of IA-32e mode) — Compatibility mode permits most legacy 16-bit and 32-bit applications to run without re-compilation under a 64-bit operating system. For brevity, the compatibility sub-mode is referred to as compatibility mode in IA-32 architecture. The execution environment of compatibility mode is the same as described in Section 3.
BASIC EXECUTION ENVIRONMENT 3.2 OVERVIEW OF THE BASIC EXECUTION ENVIRONMENT Any program or task running on an IA-32 processor is given a set of resources for executing instructions and for storing code, data, and state information. These resources (described briefly in the following paragraphs and shown in Figure 3-1) make up the basic execution environment for an IA-32 processor.
BASIC EXECUTION ENVIRONMENT Basic Program Execution Registers Address Space* 232 -1 Eight 32-bit Registers General-Purpose Registers Six 16-bit Registers Segment Registers 32-bits EFLAGS Register 32-bits EIP (Instruction Pointer Register) FPU Registers Floating-Point Data Registers Eight 80-bit Registers 0 16 bits Control Register 16 bits Status Register 16 bits Tag Register *The address space can be flat or segmented.
BASIC EXECUTION ENVIRONMENT • Stack — To support procedure or subroutine calls and the passing of parameters between procedures or subroutines, a stack and stack management resources are included in the execution environment. The stack (not shown in Figure 3-1) is located in memory. See Section 6.2, “Stacks,” for more information about stack structure.
BASIC EXECUTION ENVIRONMENT following chapters in this volume for descriptions of the other program execution resources shown in Figure 3-1: • • x87 FPU registers — See Chapter 8, “Programming with the x87 FPU.” • XMM registers — See Chapter 10, “Programming with Streaming SIMD Extensions (SSE),” Chapter 11, “Programming with Streaming SIMD Extensions 2 (SSE2),” and Chapter 12, “Programming with SSE3 and Supplemental SSE3.
BASIC EXECUTION ENVIRONMENT hold a full 64-bit base address. The local descriptor table register (LDTR) and the task register (TR) also expand to hold a full 64-bit base address.
BASIC EXECUTION ENVIRONMENT 3.3 MEMORY ORGANIZATION The memory that the processor addresses on its bus is called physical memory. Physical memory is organized as a sequence of 8-bit bytes. Each byte is assigned a unique address, called a physical address. The physical address space ranges from zero to a maximum of 236 − 1 (64 GBytes) if the processor does not support Intel 64 architecture. Intel 64 architecture introduces a changes in physical and linear address space; these are described in Section 3.3.
BASIC EXECUTION ENVIRONMENT segment prevents the stack from growing into the code or data space and overwriting instructions or data, respectively. • Real-address mode memory model — This is the memory model for the Intel 8086 processor. It is supported to provide compatibility with existing programs written to run on the Intel 8086 processor.
BASIC EXECUTION ENVIRONMENT 3.3.2 Paging and Virtual Memory With the flat or the segmented memory model, linear address space is mapped into the processor’s physical address space either directly or through paging. When using direct mapping (paging disabled), each linear address has a one-to-one correspondence with a physical address. Linear addresses are sent out on the processor’s address lines without translation.
BASIC EXECUTION ENVIRONMENT • Real-address mode — When in real-address mode, the processor only supports the real-address mode memory model. • System management mode — When in SMM, the processor switches to a separate address space, called the system management RAM (SMRAM). The memory model used to address bytes in this address space is similar to the realaddress mode model.
BASIC EXECUTION ENVIRONMENT 3.3.6 Extended Physical Addressing in Protected Mode Beginning with P6 family processors, the IA-32 architecture supports addressing of up to 64 GBytes (236 bytes) of physical memory. A program or task could not address locations in this address space directly. Instead, it addresses individual linear address spaces of up to 4 GBytes that mapped to 64-GByte physical address space through a virtual memory management mechanism.
BASIC EXECUTION ENVIRONMENT size of the current mode (64-bit mode or compatibility mode), as overridden by any address-size prefix. The result is then zero-extended to the full 64-bit address width. Because of this, 16-bit and 32-bit applications running in compatibility mode can access only the low 4 GBytes of the 64-bit mode effective addresses. Likewise, a 32-bit address generated in 64-bit mode can access only the low 4 GBytes of the 64-bit mode effective addresses. 3.3.7.
BASIC EXECUTION ENVIRONMENT • EFLAGS (program status and control) register. The EFLAGS register report on the status of the program being executed and allows limited (applicationprogram level) control of the processor. • EIP (instruction pointer) register. The EIP register contains a 32-bit pointer to the next instruction to be executed. 3.4.
BASIC EXECUTION ENVIRONMENT 31 General-Purpose Registers 0 EAX EBX ECX EDX ESI EDI EBP ESP Segment Registers 0 15 CS DS SS ES FS GS Program Status and Control Register 0 31 EFLAGS 31 Instruction Pointer 0 EIP Figure 3-4. General System and Application Programming Registers The special uses of general-purpose registers by instructions are described in Chapter 5, “Instruction Set Summary,” in this volume.
BASIC EXECUTION ENVIRONMENT • • ESP — Stack pointer (in the SS segment) EBP — Pointer to data on the stack (in the SS segment) As shown in Figure 3-5, the lower 16 bits of the general-purpose registers map directly to the register set found in the 8086 and Intel 286 processors and can be referenced with the names AX, BX, CX, DX, BP, SI, DI, and SP.
BASIC EXECUTION ENVIRONMENT Table 3-2. Addressable General Purpose Registers Register Type Without REX With REX Byte Registers AL, BL, CL, DL, AH, BH, CH, DH AL, BL, CL, DL, DIL, SIL, BPL, SPL, R8L - R15L Word Registers AX, BX, CX, DX, DI, SI, BP, SP AX, BX, CX, DX, DI, SI, BP, SP, R8W - R15W Doubleword Registers EAX, EBX, ECX, EDX, EDI, ESI, EBP, ESP EAX, EBX, ECX, EDX, EDI, ESI, EBP, ESP, R8D - R15D Quadword Registers N.A.
BASIC EXECUTION ENVIRONMENT When writing application code, programmers generally create segment selectors with assembler directives and symbols. The assembler and other tools then create the actual segment selector values associated with these directives and symbols. If writing system code, programmers may need to create segment selectors directly. See Chapter 3, “Protected-Mode Memory Management,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.
BASIC EXECUTION ENVIRONMENT Code Segment Segment Registers CS DS SS ES FS GS Data Segment Stack Segment All segments are mapped to the same linear-address space Data Segment Data Segment Data Segment Figure 3-7. Use of Segment Registers in Segmented Memory Model Each of the segment registers is associated with one of three types of storage: code, data, or stack. For example, the CS register contains the segment selector for the code segment, where the instructions being executed are stored.
BASIC EXECUTION ENVIRONMENT See Section 3.3, “Memory Organization,” for an overview of how the segment registers are used in real-address mode. The four segment registers CS, DS, SS, and ES are the same as the segment registers found in the Intel 8086 and Intel 286 processors and the FS and GS registers were introduced into the IA-32 Architecture with the Intel386™ family of processors. 3.4.2.
BASIC EXECUTION ENVIRONMENT an interrupt or exception is handled with a task switch, the state of the EFLAGS register is saved in the TSS for the task being suspended.
BASIC EXECUTION ENVIRONMENT PF (bit 2) AF (bit 4) ZF (bit 6) SF (bit 7) OF (bit 11) This flag indicates an overflow condition for unsigned-integer arithmetic. It is also used in multiple-precision arithmetic. Parity flag — Set if the least-significant byte of the result contains an even number of 1 bits; cleared otherwise. Adjust flag — Set if an arithmetic operation generates a carry or a borrow out of bit 3 of the result; cleared otherwise. This flag is used in binary-coded decimal (BCD) arithmetic.
BASIC EXECUTION ENVIRONMENT 3.4.3.3 System Flags and IOPL Field The system flags and IOPL field in the EFLAGS register control operating-system or executive operations. They should not be modified by application programs. The functions of the system flags are as follows: TF (bit 8) Trap flag — Set to enable single-step mode for debugging; clear to disable single-step mode. IF (bit 9) Interrupt enable flag — Controls the response of the processor to maskable interrupt requests.
BASIC EXECUTION ENVIRONMENT 3.4.3.4 RFLAGS Register in 64-Bit Mode In 64-bit mode, EFLAGS is extended to 64 bits and called RFLAGS. The upper 32 bits of RFLAGS register is reserved. The lower 32 bits of RFLAGS is the same as EFLAGS. 3.5 INSTRUCTION POINTER The instruction pointer (EIP) register contains the offset in the current code segment for the next instruction to be executed.
BASIC EXECUTION ENVIRONMENT The operand-size attribute selects the size of operands. When the 16-bit operandsize attribute is in force, operands can generally be either 8 bits or 16 bits, and when the 32-bit operand-size attribute is in force, operands can generally be 8 bits or 32 bits. The address-size attribute selects the sizes of addresses used to address memory: 16 bits or 32 bits. When the 16-bit address-size attribute is in force, segment offsets and displacements are 16 bits.
BASIC EXECUTION ENVIRONMENT operand-size 66H prefix to toggle to a 16-bit operand size. However, setting REX.W takes precedence over the operand-size prefix (66H) when both are used. In the case of SSE/SSE2/SSE3/SSSE3 SIMD instructions: the 66H, F2H, and F3H prefixes are mandatory for opcode extensions. In such a case, there is no interaction between a valid REX.W prefix and a 66H opcode extension prefix.
BASIC EXECUTION ENVIRONMENT example, the following ADD instruction adds an immediate value of 14 to the contents of the EAX register: ADD EAX, 14 All arithmetic instructions (except the DIV and IDIV instructions) allow the source operand to be an immediate value. The maximum value allowed for an immediate operand varies among instructions, but can never be greater than the maximum value of an unsigned doubleword integer (232). 3.7.
BASIC EXECUTION ENVIRONMENT 3.7.2.
BASIC EXECUTION ENVIRONMENT 15 Segment Selector 0 63 0 Offset (or Linear Address) Figure 3-10. Memory Operand Address in 64-Bit Mode 3.7.4 Specifying a Segment Selector The segment selector can be specified either implicitly or explicitly. The most common method of specifying a segment selector is to load it in a segment register and then allow the processor to select the register implicitly, depending on the type of operation being performed.
BASIC EXECUTION ENVIRONMENT At the machine level, a segment override is specified with a segment-override prefix, which is a byte placed at the beginning of an instruction. The following default segment selections cannot be overridden: • • Instruction fetches must be made from the code segment. • Push and pop operations must always reference the SS segment. Destination strings in string instructions must be stored in the data segment pointed to by the ES register.
BASIC EXECUTION ENVIRONMENT the possible ways that these components can be combined to create an effective address in the selected segment. Base Index EAX EBX ECX EDX ESP EBP ESI EDI EAX EBX ECX EDX EBP ESI EDI + Scale Displacement 1 None 2 * 4 8 + 8-bit 16-bit 32-bit Offset = Base + (Index * Scale) + Displacement Figure 3-11.
BASIC EXECUTION ENVIRONMENT An important special case of this combination is access to parameters in a procedure activation record. A procedure activation record is the stack frame created when a procedure is entered. Here, the EBP register is the best choice for the base register, because it automatically selects the stack segment. This is a compact encoding for this common function.
BASIC EXECUTION ENVIRONMENT combination of these components based on the language construct a programmer defines. 3.7.7 I/O Port Addressing The processor supports an I/O address space that contains up to 65,536 8-bit I/O ports. Ports that are 16-bit and 32-bit may also be defined in the I/O address space. An I/O port can be addressed with either an immediate operand or a value in the DX register. See Chapter 13, “Input/Output,” for more information about I/O port addressing. Vol.
BASIC EXECUTION ENVIRONMENT 3-34 Vol.
CHAPTER 4 DATA TYPES This chapter introduces data types defined for the Intel 64 and IA-32 architectures. A section at the end of this chapter describes the real-number and floating-point concepts used in x87 FPU, SSE, SSE2, SSE3 and SSSE3 extensions. 4.1 FUNDAMENTAL DATA TYPES The fundamental data types are bytes, words, doublewords, quadwords, and double quadwords (see Figure 4-1).
DATA TYPES Figure 4-2 shows the byte order of each of the fundamental data types when referenced as operands in memory. The low byte (bits 0 through 7) of each data type occupies the lowest address in memory and that address is also the address of the operand.
DATA TYPES Some instructions that operate on double quadwords require memory operands to be aligned on a natural boundary. These instructions generate a general-protection exception (#GP) if an unaligned operand is specified. A natural boundary for a double quadword is any address evenly divisible by 16. Other instructions that operate on double quadwords permit unaligned access (without generating a general-protection exception).
DATA TYPES Byte Unsigned Integer 7 0 Word Unsigned Integer 15 0 Doubleword Unsigned Integer 0 31 Quadword Unsigned Integer 0 63 Sign Byte Signed Integer 76 0 Sign Word Signed Integer 15 14 0 Sign Doubleword Signed Integer 31 30 0 Sign Quadword Signed Integer 63 62 0 Sign 31 30 23 22 0 Sign 63 62 Sign 79 78 52 51 Integer Bit 64 63 62 0 Figure 4-3. Numeric Data Types 4-4 Vol.
DATA TYPES 4.2.1 Integers The Intel 64 and IA-32 architectures define two types of integers: unsigned and signed. Unsigned integers are ordinary binary values ranging from 0 to the maximum positive number that can be encoded in the selected operand size. Signed integers are two’s complement binary values that can be used to represent both positive and negative integer values.
DATA TYPES Table 4-1. Signed Integer Encodings Class Two’s Complement Encoding Sign Positive Largest Smallest Zero Negative Smallest Largest Integer indefinite 0 11..11 . . . . 0 00..01 0 00..00 1 11..11 . . . . 1 00..00 1 00..00 Signed Byte Integer: Signed Word Integer: Signed Doubleword Integer: Signed Quadword Integer: ← 7 bits → ← 15 bits → ← 31 bits → ← 63 bits → The sign bit is set for negative integers and cleared for positive integers and zero.
DATA TYPES Table 4-2. Length, Precision, and Range of Floating-Point Data Types Data Type Length Precision (Bits) Approximate Normalized Range Single Precision 32 24 2–126 to 2127 1.18 × 10–38 to 3.40 × 1038 Double Precision 64 53 2–1022 to 21023 2.23 × 10–308 to 1.79 × 10308 Double Extended Precision 80 64 2–16382 to 216383 3.37 × 10–4932 to 1.18 × 104932 Binary Decimal NOTE Section 4.
DATA TYPES Table 4-3. Floating-Point Number and NaN Encodings Class Sign Biased Exponent Significand Integer Positive Negative NaNs 1 Fraction +∞ 0 11..11 1 00..00 +Normals 0 . . 0 11..10 . . 00..01 1 . . 1 11..11 . . 00..00 +Denormals 0 . . 0 00..00 . . 00..00 0 . . 0 11.11 . . 00..01 +Zero 0 00..00 0 00..00 −Zero 1 00..00 0 00..00 −Denormals 1 . . 1 00..00 . . 00..00 0 . . 0 00..01 . . 11..11 −Normals 1 . . 1 00..01 . . 11..10 1 . . 1 00..00 . . 11..
DATA TYPES When storing floating-point values in memory, single-precision values are stored in 4 consecutive bytes in memory; double-precision values are stored in 8 consecutive bytes; and double extended-precision values are stored in 10 consecutive bytes. The single-precision and double-precision floating-point data types are operated on by x87 FPU, and SSE/SSE2/SSE3 instructions. The double-extended-precision floating-point format is only operated on by the x87 FPU. See Section 11.6.
DATA TYPES 1HDU 3RLQWHU ELW 2IIVHW )DU 3RLQWHU ZLWK ELW 2SHUDQG 6L]H ELW 6HJPHQW 6HOHFWRU ELW 2IIVHW )DU 3RLQWHU ZLWK ELW 2SHUDQG 6L]H ELW 6HJPHQW 6HOHFWRU ELW 2IIVHW 20 Figure 4-5. Pointers in 64-Bit Mode 4.4 BIT FIELD DATA TYPE A bit field (see Figure 4-6) is a contiguous sequence of bits. It can begin at any bit position of any byte in memory and can contain up to 32 bits. Bit Field Field Length Least Significant Bit Figure 4-6.
DATA TYPES 4.6 PACKED SIMD DATA TYPES Intel 64 and IA-32 architectures define and operate on a set of 64-bit and 128-bit packed data type for use in SIMD operations. These data types consist of fundamental data types (packed bytes, words, doublewords, and quadwords) and numeric interpretations of fundamental types for use in packed integer and packed floatingpoint operations. 4.6.
DATA TYPES 4.6.2 128-Bit Packed SIMD Data Types The 128-bit packed SIMD data types were introduced into the IA-32 architecture in the SSE extensions and used with SSE2, SSE3 and SSSE3 extensions. They are operated on primarily in the 128-bit XMM registers and memory. The fundamental 128-bit packed data types are packed bytes, packed words, packed doublewords, and packed quadwords (see Figure 4-8).
DATA TYPES 4.7 BCD AND PACKED BCD INTEGERS Binary-coded decimal integers (BCD integers) are unsigned 4-bit integers with valid values ranging from 0 to 9. IA-32 architecture defines operations on BCD integers located in one or more general-purpose registers or in one or more x87 FPU registers (see Figure 4-9).
DATA TYPES Table 4-4. Packed Decimal Integer Encodings Magnitude Class Sign Positive Largest 0 0000000 . . . . . . Smallest 0 0000000 0000 0000 Zero 0 0000000 0000 Negative Zero 1 0000000 Smallest 1 0000000 . . . . . . Largest 1 0000000 1001 1001 Packed BCD Integer Indefinit e 1 1111111 1111 1111 ← 1 byte → digit digit digit digit ... digit 1001 1001 1001 1001 ... 1001 0000 0000 ... 0001 0000 0000 0000 ... 0000 0000 0000 0000 0000 ...
DATA TYPES 4.8.1 Real Number System As shown in Figure 4-10, the real-number system comprises the continuum of real numbers from minus infinity (− ∞) to plus infinity (+ ∞). Because the size and number of registers that any computer can have is limited, only a subset of the real-number continuum can be used in real-number (floating-point) calculations.
DATA TYPES -100 Binary Real Number System 10 -1 0 -10 1 100 ςς ςς Subset of binary real numbers that can be represented with IEEE single-precision (32-bit) floating-point format 10 -1 0 100 -100 -10 1 ςς ςς +10 10.0000000000000000000000 Precision 1.11111111111111111111111 24 Binary Digits Numbers within this range cannot be represented. Figure 4-10. Binary Real Number System Sign Exponent Significand Fraction Integer or J-Bit Figure 4-11. Binary Floating-Point Format 4-16 Vol.
DATA TYPES Table 4-5. Real and Floating-Point Number Notation Notation Value Ordinary Decimal 178.125 Scientific Decimal 1.78125E10 2 Scientific Binary 1.0110010001E2111 Scientific Binary (Biased Exponent) 1.0110010001E210000110 IEEE Single-Precision Format Sign Biased Exponent Normalized Significand 0 10000110 0110010001000000000000 0 1. (Implied) 4.8.2.1 Normalized Numbers In most cases, floating-point numbers are encoded in normalized form.
DATA TYPES 4.8.3 Real Number and Non-number Encodings A variety of real numbers and special values can be encoded in the IEEE Standard 754 floating-point format. These numbers and values are generally divided into the following classes: • • • • • • Signed zeros Denormalized finite numbers Normalized finite numbers Signed infinities NaNs Indefinite numbers (The term NaN stands for “Not a Number.”) Figure 4-12 shows how the encodings for these numbers and non-numbers fit into the real number continuum.
DATA TYPES An IA-32 processor can operate on and/or return any of these values, depending on the type of computation being performed. The following sections describe these number and non-number classes. 4.8.3.1 Signed Zeros Zero can be represented as a +0 or a −0 depending on the sign bit. Both encodings are equal in value. The sign of a zero result depends on the operation being performed and the rounding mode being used. Signed zeros have been provided to aid in implementing interval arithmetic.
DATA TYPES Table 4-6. Denormalization Process Operation Sign Exponent* Significand True Result 0 −129 1.01011100000...00 Denormalize 0 −128 0.10101110000...00 Denormalize 0 −127 0.01010111000...00 Denormalize 0 −126 0.00101011100...00 Denormal Result 0 −126 0.00101011100...00 * Expressed as an unbiased, decimal number. In the extreme case, all the significant bits are shifted out to the right by leading zeros, creating a zero result.
DATA TYPES above the ends of the real number line. This space includes any value with the maximum allowable biased exponent and a non-zero fraction (the sign bit is ignored for NaNs). The IA-32 architecture defines two classes of NaNs: quiet NaNs (QNaNs) and signaling NaNs (SNaNs). A QNaN is a NaN with the most significant fraction bit set; an SNaN is a NaN with the most significant fraction bit clear. QNaNs are allowed to propagate through most arithmetic operations without signaling an exception.
DATA TYPES Table 4-7. Rules for Handling NaNs Source Operands SNaN and QNaN Result1 x87 FPU — QNaN source operand.
DATA TYPES operand address field of the exception pointer will point to the NaN, and the NaN will contain the index number of the array element. Quiet NaNs are often used to speed up debugging. In its early testing phase, a program often contains multiple errors. An exception handler can be written to save diagnostic information in memory whenever it was invoked.
DATA TYPES The processor then sets the result to b or to c according to the selected rounding mode. Rounding introduces an error in a result that is less than one unit in the last place (the least significant bit position of the floating-point value) to which the result is rounded. The IEEE Standard 754 defines four rounding modes (see Table 4-8): round to nearest, round up, round down, and round toward zero. The default rounding mode (for the Intel 64 and IA-32 architectures) is round to nearest.
DATA TYPES • The MXCSR register (bits 13 and 14) Although these two RC fields perform the same function, they control rounding for different execution environments within the processor. The RC field in the x87 FPU control register controls rounding for computations performed with the x87 FPU instructions; the RC field in the MXCSR register controls rounding for SIMD floatingpoint computations performed with the SSE/SSE2 instructions. 4.8.4.
DATA TYPES occurs). The numeric-underflow, numeric-overflow and precision exceptions are post-computation exceptions. Each of the six exception classes has a corresponding flag bit (IE, ZE, OE, UE, DE, or PE) and mask bit (IM, ZM, OM, UM, DM, or PM). When one or more floating-point exception conditions are detected, the processor sets the appropriate flag bits, then takes one of two possible courses of action, depending on the settings of the corresponding mask bits: • Mask bit set.
DATA TYPES 4.9.1 Floating-Point Exception Conditions The following sections describe the various conditions that cause a floating-point exception to be generated and the masked response of the processor when these conditions are detected. The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 3A & 3B, list the floating-point exceptions that can be signaled for each floating-point instruction. 4.9.1.
DATA TYPES See the following sections for information regarding the denormal-operand exception when detected while executing x87 FPU or SSE/SSE2/SSE3 instructions: • • x87 FPU; Section 8.5.2, “Denormal Operand Exception (#D)” SIMD floating-point exceptions; Section 11.5.2.2, “Denormal-Operand Exception (#D)” 4.9.1.3 Divide-By-Zero Exception (#Z) The processor reports the floating-point divide-by-zero exception whenever an instruction attempts to divide a finite non-zero operand by 0.
DATA TYPES Table 4-9. Numeric Overflow Thresholds Floating-Point Format Overflow Thresholds Single Precision | x | ≥ 1.0 ∗ 2128 Double Precision | x | ≥ 1.0 ∗ 21024 Double Extended Precision | x | ≥ 1.0 ∗ 216384 When a numeric-overflow exception occurs and the exception is masked, the processor sets the OE flag and returns one of the values shown in Table 4-10, according to the current rounding mode. See Section 4.8.4, “Rounding.
DATA TYPES numeric underflow for each of the floating-point formats (assuming normalized results); underflow occurs when a rounded result falls strictly within the threshold range. The ability to detect and handle underflow is provided to prevent a vary small result from propagating through a computation and causing another exception (such as overflow during division) to be generated at a later time. Table 4-11.
DATA TYPES exception occurs frequently and indicates that some (normally acceptable) accuracy will be lost due to rounding. The exception is supported for applications that need to perform exact arithmetic only. Because the rounded result is generally satisfactory for most applications, this exception is commonly masked.
DATA TYPES the destination. Alternately, a denormal-operand or inexact-result exception can accompany a numeric underflow or overflow exception with both exceptions being handled. The precedence for floating-point exceptions is as follows: 1. Invalid-operation exception, subdivided as follows: a. stack underflow (occurs with x87 FPU only) b. stack overflow (occurs with x87 FPU only) c.
DATA TYPES • • Clearing the exception flags Returning to the interrupted program and resuming normal execution In lieu of writing recovery procedures, the exception handler can do the following: • • • Increment in software an exception counter for later display or printing Print or display diagnostic information (such as the state information) Halt further program execution Vol.
DATA TYPES 4-34 Vol.
CHAPTER 5 INSTRUCTION SET SUMMARY This chapter provides an abridged overview of Intel 64 and IA-32 instructions. Instructions are divided into the following groups: • • • • • • • • • • • General purpose x87 FPU x87 FPU and SIMD state management Intel MMX technology SSE extensions SSE2 extensions SSE3 extensions SSSE3 extensions System instructions IA-32e mode: 64-bit mode instructions VMX instructions Table 5-1 lists the groups and IA-32 processors that support each group.
INSTRUCTION SET SUMMARY Table 5-1. Instruction Groups and IA-32 Processors (Contd.
INSTRUCTION SET SUMMARY 5.1.1 Data Transfer Instructions The data transfer instructions move data between memory and the general-purpose and segment registers. They also perform specific operations such as conditional moves, stack access, and data conversion.
INSTRUCTION SET SUMMARY CMOVNP/CMOVPO Conditional move if not parity/Conditional move if parity odd XCHG Exchange BSWAP Byte swap XADD Exchange and add CMPXCHG Compare and exchange CMPXCHG8B Compare and exchange 8 bytes PUSH Push onto stack POP Pop off of stack PUSHA/PUSHAD Push general-purpose registers onto stack POPA/POPAD Pop general-purpose registers from stack CWD/CDQ Convert word to doubleword/Convert doubleword to quadword CBW/CWDE Convert byte to word/Convert word to doublew
INSTRUCTION SET SUMMARY 5.1.3 Decimal Arithmetic Instructions The decimal arithmetic instructions perform decimal arithmetic on binary coded decimal (BCD) data. DAA Decimal adjust after addition DAS Decimal adjust after subtraction AAA ASCII adjust after addition AAS ASCII adjust after subtraction AAM ASCII adjust after multiplication AAD ASCII adjust before division 5.1.
INSTRUCTION SET SUMMARY 5.1.6 Bit and Byte Instructions Bit instructions test and modify individual bits in word and doubleword operands. Byte instructions set the value of a byte operand to indicate the status of flags in the EFLAGS register.
INSTRUCTION SET SUMMARY 5.1.7 Control Transfer Instructions The control transfer instructions provide jump, conditional jump, loop, and call and return operations to control program flow.
INSTRUCTION SET SUMMARY INTO Interrupt on overflow BOUND Detect value out of range ENTER High-level procedure entry LEAVE High-level procedure exit 5.1.8 String Instructions The string instructions operate on strings of bytes, allowing them to be moved to and from memory.
INSTRUCTION SET SUMMARY OUTS/OUTSB Output string to port/Output byte string to port OUTS/OUTSW Output string to port/Output word string to port OUTS/OUTSD Output string to port/Output doubleword string to port 5.1.10 Enter and Leave Instructions These instructions provide machine-language support for procedure calls in blockstructured languages. ENTER High-level procedure entry LEAVE High-level procedure exit 5.1.
INSTRUCTION SET SUMMARY 5.1.13 Miscellaneous Instructions The miscellaneous instructions provide such functions as loading an effective address, executing a “no-operation,” and retrieving processor identification information. LEA Load effective address NOP No operation UD2 Undefined instruction XLAT/XLATB Table lookup translation CPUID Processor Identification 5.2 X87 FPU INSTRUCTIONS The x87 FPU instructions are executed by the processor’s x87 FPU.
INSTRUCTION SET SUMMARY FCMOVNB Floating-point conditional move if not below FCMOVNBE Floating-point conditional move if not below or equal FCMOVU Floating-point conditional move if unordered FCMOVNU Floating-point conditional move if not unordered 5.2.2 x87 FPU Basic Arithmetic Instructions The basic arithmetic instructions perform basic arithmetic operations on floatingpoint and integer operands.
INSTRUCTION SET SUMMARY 5.2.3 x87 FPU Comparison Instructions The compare instructions examine or compare floating-point or integer operands.
INSTRUCTION SET SUMMARY FLDLN2 Load loge2 FLDL2T Load log210 Load log102 FLDLG2 5.2.6 x87 FPU Control Instructions The x87 FPU control instructions operate on the x87 FPU register stack and save and restore the x87 FPU state.
INSTRUCTION SET SUMMARY Initially, these instructions operated only on the x87 FPU (and MMX) registers to perform a fast save and restore, respectively, of the x87 FPU and MMX state. With the introduction of SSE extensions in the Pentium III processor family, these instructions were expanded to also save and restore the state of the XMM and MXCSR registers. Intel 64 architecture also supports these instructions. See Section 10.5, “FXSAVE and FXRSTOR Instructions,” for more detail. 5.
INSTRUCTION SET SUMMARY PUNPCKHWD Unpack high-order words PUNPCKHDQ Unpack high-order doublewords PUNPCKLBW Unpack low-order bytes PUNPCKLWD Unpack low-order words PUNPCKLDQ Unpack low-order doublewords 5.4.3 MMX Packed Arithmetic Instructions The packed arithmetic instructions perform packed integer arithmetic on packed byte, word, and doubleword integers.
INSTRUCTION SET SUMMARY 5.4.5 MMX Logical Instructions The logical instructions perform AND, AND NOT, OR, and XOR operations on quadword operands. PAND Bitwise logical AND PANDN Bitwise logical AND NOT POR Bitwise logical OR PXOR Bitwise logical exclusive OR 5.4.6 MMX Shift and Rotate Instructions The shift and rotate instructions shift and rotate packed bytes, words, or doublewords, or quadwords in 64-bit operands.
INSTRUCTION SET SUMMARY SSE instructions are divided into four subgroups (note that the first subgroup has subordinate subgroups of its own): • SIMD single-precision floating-point instructions that operate on the XMM registers • • • MXSCR state management instructions 64-bit SIMD integer instructions that operate on the MMX registers Cacheability control, prefetch, and instruction ordering instructions The following sections provide an overview of these groups. 5.5.
INSTRUCTION SET SUMMARY 5.5.1.2 SSE Packed Arithmetic Instructions SSE packed arithmetic instructions perform packed and scalar arithmetic operations on packed and scalar single-precision floating-point operands.
INSTRUCTION SET SUMMARY 5.5.1.4 SSE Logical Instructions SSE logical instructions perform bitwise AND, AND NOT, OR, and XOR operations on packed single-precision floating-point operands.
INSTRUCTION SET SUMMARY 5.5.2 SSE MXCSR State Management Instructions MXCSR state management instructions allow saving and restoring the state of the MXCSR control and status register. LDMXCSR Load MXCSR register STMXCSR Save MXCSR register state 5.5.3 SSE 64-Bit SIMD Integer Instructions These SSE 64-bit SIMD integer instructions perform additional operations on packed bytes, words, or doublewords contained in MMX registers.
INSTRUCTION SET SUMMARY PREFETCHh Load 32 or more of bytes from memory to a selected level of the processor’s cache hierarchy SFENCE Serializes store operations 5.6 SSE2 INSTRUCTIONS SSE2 extensions represent an extension of the SIMD execution model introduced with MMX technology and the SSE extensions. SSE2 instructions operate on packed double-precision floating-point operands and on packed byte, word, doubleword, and quadword operands located in the XMM registers.
INSTRUCTION SET SUMMARY MOVUPD Move two unaligned packed double-precision floating-point values between XMM registers or between and XMM register and memory MOVHPD Move high packed double-precision floating-point value to an from the high quadword of an XMM register and memory MOVLPD Move low packed single-precision floating-point value to an from the low quadword of an XMM register and memory MOVMSKPD Extract sign mask from two packed double-precision floatingpoint values MOVSD Move scalar double
INSTRUCTION SET SUMMARY ANDNPD Perform bitwise logical AND NOT of packed double-precision floating-point values ORPD Perform bitwise logical OR of packed double-precision floatingpoint values XORPD Perform bitwise logical XOR of packed double-precision floatingpoint values 5.6.1.4 SSE2 Compare Instructions SSE2 compare instructions compare packed and scalar double-precision floatingpoint values and return the results of the comparison either to the destination operand or to the EFLAGS register.
INSTRUCTION SET SUMMARY CVTPD2DQ Convert packed double-precision floating-point values to packed doubleword integers CVTTPD2DQ Convert with truncation packed double-precision floating-point values to packed doubleword integers CVTDQ2PD Convert packed doubleword integers to packed double-precision floating-point values CVTPS2PD Convert packed single-precision floating-point values to packed double-precision floating-point values CVTPD2PS Convert packed double-precision floating-point values to pack
INSTRUCTION SET SUMMARY PMULUDQ Multiply packed unsigned doubleword integers PADDQ Add packed quadword integers PSUBQ Subtract packed quadword integers PSHUFLW Shuffle packed low words PSHUFHW Shuffle packed high words PSHUFD Shuffle packed doublewords PSLLDQ Shift double quadword left logical PSRLDQ Shift double quadword right logical PUNPCKHQDQ Unpack high quadwords PUNPCKLQDQ Unpack low quadwords 5.6.
INSTRUCTION SET SUMMARY • • Three SIMD floating-point LOAD/MOVE/DUPLICATE instructions Two thread synchronization instructions SSE3 instructions can only be executed on Intel 64 and IA-32 processors that support SSE3 extensions. Support for these instructions can be detected with the CPUID instruction. See the description of the CPUID instruction in Chapter 3, “Instruction Set Reference, A-M,” of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A.
INSTRUCTION SET SUMMARY element of the second operand from the first element of the second operand; and the fourth by subtracting the fourth element of the second operand from the third element of the second operand. HADDPD Performs a double-precision addition on contiguous data elements. The first data element of the result is obtained by adding the first and second elements of the first operand; the second element by adding the first and second elements of the second operand.
INSTRUCTION SET SUMMARY 5.8 SUPPLEMENTAL STREAMING SIMD EXTENSIONS 3 (SSSE3) INSTRUCTIONS SSSE3 provide 32 instructions (represented by 14 mnemonics) to accelerate computations on packed integers. These include: • • • Twelve instructions that perform horizontal addition or subtraction operations. • Two instructions that accelerate packed-integer multiply operations and produce integer values with scaling.
INSTRUCTION SET SUMMARY tion operands. The signed, saturated 16-bit results are packed and written to the destination operand. PHSUBD 5.8.2 Performs horizontal subtraction on each adjacent pair of 32-bit signed integers by subtracting the most significant doubleword from the least significant double word of each pair in the source and destination operands. The signed 32-bit results are packed and written to the destination operand.
INSTRUCTION SET SUMMARY 5.8.6 Packed Sign PSIGNB/W/D 5.8.7 PALIGNR 5.9 Negates each signed integer element of the destination operand if the sign of the corresponding data element in the source operand is less than zero. Packed Align Right Source operand is appended after the destination operand forming an intermediate value of twice the width of an operand.
INSTRUCTION SET SUMMARY LOCK (prefix) Lock Bus HLT Halt processor RSM Return from system management mode (SMM) RDMSR Read model-specific register WRMSR Write model-specific register RDPMC Read performance monitoring counters RDTSC Read time stamp counter SYSENTER Fast System Call, transfers to a flat protected mode kernel at CPL = 0 SYSEXIT Fast System Call, transfers to a flat protected mode kernel at CPL = 3 5.
INSTRUCTION SET SUMMARY the VMCS have been written to the VMCS-data area in the referenced VMCS region. VMREAD Reads a component from the VMCS (the encoding of that field is given in a register operand) and stores it into a destination operand. VMWRITE Writes a component to the VMCS (the encoding of that field is given in a register operand) from a source operand.
CHAPTER 6 PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS This chapter describes the facilities in the Intel 64 and IA-32 architectures for executing calls to procedures or subroutines. It also describes how interrupts and exceptions are handled from the perspective of an application programmer. 6.1 PROCEDURE CALL TYPES The processor supports procedure calls in the following two different ways: • • CALL and RET instructions. ENTER and LEAVE instructions, in conjunction with the CALL and RET instructions.
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS When a system sets up many stacks, only one stack—the current stack—is available at a time. The current stack is the one contained in the segment referenced by the SS register. Stack Segment Bottom of Stack (Initial ESP Value) Local Variables for Calling Procedure The Stack Can Be 16 or 32 Bits Wide Parameters Passed to Called Procedure The EBP register is typically set to point to the return instruction pointer.
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS 3. Load the stack pointer for the stack into the ESP register using a MOV, POP, or LSS instruction. The LSS instruction can be used to load the SS and ESP registers in one operation. See “Segment Descriptors” in of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, for information on how to set up a segment descriptor and segment limits for a stack segment. 6.2.
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS 6.2.4 Procedure Linking Information The processor provides two pointers for linking of procedures: the stack-frame base pointer and the return instruction pointer. When used in conjunction with a standard software procedure-call technique, these pointers permit reliable and coherent linking of procedures. 6.2.4.1 Stack-Frame Base Pointer The stack is typically divided into frames.
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS 6.2.5 Stack Behavior in 64-Bit Mode In 64-bit mode, address calculations that reference SS segments are treated as if the segment base is zero. Fields (base, limit, and attribute) in segment descriptor registers are ignored. SS DPL is modified such that it is always equal to CPL. This will be true even if it is the only field in the SS descriptor that is modified.
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS When executing a near return, the processor performs these actions: 1. Pops the top-of-stack value (the return instruction pointer) into the EIP register. 2. If the RET instruction has an optional n argument, increments the stack pointer by the number of bytes specified with the n operand to release parameters from the stack. 3. Resumes execution of the calling procedure. 6.3.
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS Stack Frame Before Call Stack Frame After Call Stack During Near Call Param 1 Param 2 Param 3 Calling EIP Stack During Near Return Stack Frame Before Call ESP Before Call ESP After Call Stack Frame After Call Stack During Far Call Param 1 Param 2 Param 3 Calling CS Calling EIP ESP After Call Stack During Far Return ESP After Return Param 1 Param 2 Param 3 Calling EIP ESP Before Call ESP Before Return ESP After Return Param 1 Param 2 Param 3 Calling C
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS use the stack-frame base pointer (in the EBP register) to make a frame boundary for easy access to the parameters. The stack can also be used to pass parameters back from the called procedure to the calling procedure. 6.3.3.3 Passing Parameters in an Argument List An alternate method of passing a larger number of parameters (or a data structure) to the called procedure is to place the parameters in an argument list in one of the data segments in memory.
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS Protection Rings Operating System Kernel Level 0 Operating System Services (Device Drivers, Etc.) Level 1 Applications Level 2 Level 3 Highest 0 1 2 Lowest 3 Privilege Levels Figure 6-3. Protection Rings In this example, the highest privilege level 0 (at the center of the diagram) is used for segments that contain the most critical code modules in the system, usually the kernel of an operating system.
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS • The processor switches to a new stack to execute the called procedure. Each privilege level has its own stack. The segment selector and stack pointer for the privilege level 3 stack are stored in the SS and ESP registers, respectively, and are automatically saved when a call to a more privileged level occurs. The segment selectors and stack pointers for the privilege level 2, 1, and 0 stacks are stored in a system segment called the task state segment (TSS).
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS 3. Loads the segment selector and stack pointer for the new stack (that is, the stack for the privilege level being called) from the TSS into the SS and ESP registers and switches to the new stack. 4. Pushes the temporarily saved SS and ESP values for the calling procedure’s stack onto the new stack. 5. Copies the parameters from the calling procedure’s stack to the new stack.
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS • In 64-bit mode and compatibility mode, 64-bit call-gate descriptors for far calls are available In 64-bit mode, the operand size for all near branches (CALL, RET, JCC, JCXZ, JMP, and LOOP) is forced to 64 bits. These instructions update the 64-bit RIP without the need for a REX operand-size prefix.
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS 6.4 INTERRUPTS AND EXCEPTIONS The processor provides two mechanisms for interrupting program execution, interrupts and exceptions: • An interrupt is an asynchronous event that is typically triggered by an I/O device. • An exception is a synchronous event that is generated when the processor detects one or more predefined conditions while executing an instruction. The IA-32 architecture specifies three classes of exceptions: faults, traps, and aborts.
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS 6.4.1 Call and Return Operation for Interrupt or Exception Handling Procedures A call to an interrupt or exception handler procedure is similar to a procedure call to another protection level (see Section 6.3.6, “CALL and RET Operation Between Privilege Levels”). Here, the interrupt vector references one of two kinds of gates: an interrupt gate or a trap gate.
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS Table 6-1. Exceptions and Interrupts (Contd.) Vector No. Mnemonic 13 #GP General Protection Any memory reference and other protection checks. 14 #PF Page Fault Any memory reference. 15 Description Source Reserved 16 #MF Floating-Point Error (Math Fault) Floating-point or WAIT/FWAIT instruction. 17 #AC Alignment Check Any data reference in memory.3 18 #MC Machine Check Error codes (if any) and source are model dependent.
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS Interrupted Procedure’s and Handler’s Stack EFLAGS CS EIP Error Code Stack Usage with No Privilege-Level Change ESP Before Transfer to Handler ESP After Transfer to Handler Stack Usage with Privilege-Level Change Interrupted Procedure’s Stack Handler’s Stack ESP Before Transfer to Handler ESP After Transfer to Handler SS ESP EFLAGS CS EIP Error Code Figure 6-5.
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS A return from an interrupt or exception handler is initiated with the IRET instruction. The IRET instruction is similar to the far RET instruction, except that it also restores the contents of the EFLAGS register for the interrupted procedure. When executing a return from an interrupt or exception handler from the same privilege level as the interrupted procedure, the processor performs these actions: 1.
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS interrupt table contains instruction pointers to the interrupt and exception handler procedures. The processor saves the state of the EFLAGS register, the EIP register, the CS register, and an optional error code on the stack before switching to the handler procedure. A return from the interrupt or exception handler is carried out with the IRET instruction.
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS exception (#MF); when an SSE/SSE2/SSE3 instruction generates a floating-point exception, it in turn generates SIMD floating-point exception (#XF). See the following sections for further descriptions of the floating-point exceptions, how they are generated, and how they are handled: • Section 4.9.1, “Floating-Point Exception Conditions,” and Section 4.9.3, “Typical Actions of a Floating-Point Exception Handler” • Section 8.
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS ENTER and LEAVE offer two benefits: • They provide machine-language support for implementing block-structured languages, such as C and Pascal. • They simplify procedure entry and exit in compiler-generated code. 6.5.1 ENTER Instruction The ENTER instruction creates a stack frame compatible with the scope rules typically used in block-structured languages.
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS the EBP register on the stack, copies the contents of the ESP register into the EBP register, and subtracts the first operand from the contents of the ESP register to allocate dynamic storage. The non-nested form differs from the nested form in that no stack frame pointers are copied. The nested form of the ENTER instruction occurs when the second parameter (lexical level) is not zero.
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS Main (Lexical Level 1) Procedure A (Lexical Level 2) Procedure B (Lexical Level 3) Procedure C (Lexical Level 3) Procedure D (Lexical Level 4) Figure 6-6. Nested Procedures Block-structured languages can use the lexical levels defined by ENTER to control access to the variables of nested procedures.
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS Therefore the base address for the dynamic storage used in MAIN is the current address in the EBP register, plus four bytes to account for the saved contents of MAIN’s EBP register. All dynamic variables for MAIN are at fixed, positive offsets from this value. Old EBP Display EBP Main’s EBP Dynamic Storage ESP Figure 6-7.
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS When procedure B calls procedure C, the ENTER instruction creates a new display for procedure C (see Figure 6-10). The first doubleword holds a copy of the last value in procedure B’s EBP register. This is used by the LEAVE instruction to restore procedure B’s stack frame. The second and third doublewords are copies of the two stack frame pointers in procedure A’s display.
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS Old EBP Main’s EBP Main’s EBP Main’s EBP Procedure A’s EBP Procedure A’s EBP Main’s EBP Procedure A’s EBP Procedure B’s EBP Procedure B’s EBP Display EBP Main’s EBP Procedure A’s EBP Procedure C’s EBP Dynamic Storage ESP Figure 6-10. Stack Frame After Entering Procedure C Vol.
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS 6.5.2 LEAVE Instruction The LEAVE instruction, which does not have any operands, reverses the action of the previous ENTER instruction. The LEAVE instruction copies the contents of the EBP register into the ESP register to release all stack space allocated to the procedure. Then it restores the old value of the EBP register from the stack. This simultaneously restores the ESP register to its original value.
CHAPTER 7 PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS General-purpose (GP) instructions are a subset of the IA-32 instructions that represent the fundamental instruction set for the Intel IA-32 processors. These instructions were introduced into the IA-32 architecture with the first IA-32 processors (the Intel 8086 and 8088).
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS • • • • Signed and unsigned byte, word, doubleword integers Near and far pointers Bit fields BCD integers 7.2 PROGRAMMING ENVIRONMENT FOR GP INSTRUCTIONS IN 64-BIT MODE The programming environment for the general-purpose instructions in 64-bit mode is similar to that described in Section 7.1. • General-purpose registers — In 64-bit mode, sixteen general-purpose registers available. These include the eight GPRs described in Section 7.
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS 7.3 SUMMARY OF GP INSTRUCTIONS General purpose instructions are divided into the following subgroups: • • • • • • • • • • • • • Data transfer Binary arithmetic Decimal arithmetic Logical Shift and rotate Bit and byte Control transfer String I/O Enter and Leave Flag control Segment register Miscellaneous Each sub-group of general-purpose instructions is discussed in the context of non64-bit mode operation first.
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS to/from Control Registers” and “MOV—Move to/from Debug Registers” in Chapter 3, “Instruction Set Reference, A-M,” of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A, for information on moving data to and from the control and debug registers.) The MOV instruction cannot move data from one memory location to another or from one segment register to another segment register.
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS These conditional move instructions are supported in the P6 family, Pentium 4, and Intel Xeon processors. Software can check if CMOVcc instructions are supported by checking the processor’s feature information with the CPUID instruction. 7.3.1.2 Exchange Instructions The exchange instructions swap the contents of one or more operands and, in some cases, perform additional operations such as asserting the LOCK signal or modifying flags in the EFLAGS register.
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS Table 7-2. Conditional Move Instructions (Contd.
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS The CMPXCHG8B instruction also requires three operands: a 64-bit value in EDX:EAX, a 64-bit value in ECX:EBX, and a destination operand in memory. The instruction compares the 64-bit value in the EDX:EAX registers with the destination operand. If they are equal, the 64-bit value in the ECX:EBX register is stored in the destination operand. If the EDX:EAX register and the destination are not equal, the destination is loaded in the EDX:EAX register.
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS Stack Before Pushing Registers 31 0 Stack Growth n n-4 n-8 n - 12 n - 16 n - 20 n - 24 n - 28 n - 32 n - 36 After Pushing Registers 31 0 ESP EAX ECX EDX EBX Old ESP EBP ESI EDI ESP Figure 7-2. Operation of the PUSHA Instruction The POP instruction copies the word or doubleword at the current top of stack (indicated by the ESP register) to the location specified with the destination operand.
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS Stack Stack Growth After Popping Registers Before Popping Registers 0 31 n n-4 n-8 n - 12 n - 16 n - 20 n - 24 n - 28 n - 32 n - 36 0 31 ESP EAX ECX EDX EBX Ignored EBP ESI EDI ESP Figure 7-4. Operation of the POPA Instruction 7.3.1.5 Stack Manipulation Instructions in 64-Bit Mode In 64-bit mode, the stack pointer size is 64 bits and cannot be overridden by an instruction prefix. In implicit stack references, address-size overrides are ignored.
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS 15 0 S N N N N N N N N N N N N N N N 31 15 Before Sign Extension 0 S S S S S S S S S S S S S S S S S N N N N N N N N N N N N N N N After Sign Extension Figure 7-5. Sign Extension Simple conversion — The CBW (convert byte to word), CWDE (convert word to doubleword extended), CWD (convert word to doubleword), and CDQ (convert doubleword to quadword) instructions perform sign extension to double the size of the source operand.
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS For the purpose of this discussion, these instructions are divided subordinate subgroups of instructions that: • • • • Add and subtract Increment and decrement Compare and change signs Multiply and divide 7.3.2.
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS The NEG (negate) instruction subtracts a signed integer operand from zero. The effect of the NEG instruction is to change the sign of a two's complement operand while keeping its magnitude. 7.3.2.5 Multiplication and Divide Instructions The processor provides two multiply instructions, MUL (unsigned multiply) and IMUL signed multiply), and two divide instructions, DIV (unsigned divide) and IDIV (signed divide).
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS (see Section 4.7, “BCD and Packed BCD Integers”). Adding two packed BCD values requires two instructions: an ADD instruction followed by a DAA instruction. The ADD instruction adds (binary addition) the two values and stores the result in the AL register. The DAA instruction then adjusts the value in the AL register to obtain a valid, 2-digit, packed BCD value and sets the CF flag if a decimal carry occurred as the result of the addition.
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS 7.3.4 Decimal Arithmetic Instructions in 64-Bit Mode Decimal arithmetic instructions are not supported in 64-bit mode, They are either invalid or not encodable. 7.3.5 Logical Instructions The logical instructions AND, OR, XOR (exclusive or), and NOT perform the standard Boolean operations for which they are named. The AND, OR, and XOR instructions require two operands; the NOT instruction operates on a single operand. 7.3.
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS Initial State Operand CF X 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 1 1 After 1-bit SHL/SAL Instruction 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 1 1 0 After 10-bit SHL/SAL Instruction 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 Figure 7-6. SHL/SAL Instruction Operation The SHR instruction shifts the source operand right by from 1 to 31 bit positions (see Figure 7-7).
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS The SAR and SHR instructions can also be used to perform division by powers of 2 (see “SAL/SAR/SHL/SHR—Shift Instructions” in Chapter 4, “Instruction Set Reference, N-Z,” of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2B).
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS SHLD Instruction 31 CF 0 Destination (Memory or Register) 31 0 Source (Register) SHRD Instruction 31 0 Source (Register) 31 0 Destination (Memory or Register) CF Figure 7-9. SHLD and SHRD Instruction Operations The SHLD instruction shifts the bits in the destination operand to the left and fills the empty bit positions (in the destination operand) with bits shifted out of the source operand.
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS 31 CF ROL Instruction 0 Destination (Memory or Register) 31 ROR Instruction 0 Destination (Memory or Register) 31 CF RCL Instruction CF 0 Destination (Memory or Register) 31 RCR Instruction 0 Destination (Memory or Register) CF Figure 7-10. ROL, ROR, RCL, and RCR Instruction Operations The ROL instruction rotates the bits in the operand to the left (toward more significant bit locations).
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS 7.3.7 Bit and Byte Instructions These instructions operate on bit or byte strings. For the purpose of this discussion, they are further divided subordinate subgroups that: • • • • Test and modify a single bit Scan a bit string Set a byte given conditions Test operands and report results 7.3.7.1 Bit Test and Modify Instructions The bit test and modify instructions (see Table 7-3) operate on a single bit, which can be in an operand.
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS “EFLAGS Condition Codes,” lists the conditions it is possible to test for with this instruction. 7.3.7.4 Test Instruction The TEST instruction performs a logical AND of two operands and sets the SF, ZF, and PF flags according to the results. The flags can then be tested by the conditional jump or loop instructions or the SETcc instructions. The TEST instruction differs from the AND instruction in that it does not alter either of the operands. 7.3.
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS • An address specified using the standard addressing modes of the processor — Here, the address can be a near pointer or a far pointer. If the address is for a near pointer, the address is translated into an offset and copied into the EIP register. If the address is for a far pointer, the address is translated into a segment selector (which is copied into the CS register) and an offset (which is copied into the EIP register).
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS 7.3.8.2 Conditional Transfer Instructions The conditional transfer instructions execute jumps or loops that transfer program control to another instruction in the instruction stream if specified conditions are met. The conditions for control transfer are specified with a set of condition codes that define various states of the status flags (CF, ZF, OF, PF, and SF) in the EFLAGS register.
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS The destination operand specifies a relative address (a signed offset with respect to the address in the EIP register) that points to an instruction in the current code segment.
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS The LOOPNE and LOOPNZ instructions (mnemonics for the same instruction) operate the same as the LOOPE/LOOPPZ instructions, except that they terminate the loop if the ZF flag is set. Jump if zero instructions — The JECXZ (jump if ECX zero) instruction jumps to the location specified in the destination operand if the ECX register contains the value zero.
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS The INT n instruction can raise any of the processor’s interrupts or exceptions by encoding the vector number or the interrupt or exception in the instruction. This instruction can be used to support software generated interrupts or to test the operation of interrupt and exception handlers. The IRET (return from interrupt) instruction returns program control from an interrupt handler to the interrupted procedure.
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS ating the ESI register with the ES segment register, both the source and destination strings can be located in the same segment. (This latter condition can also be achieved by loading the DS and ES segment registers with the same segment selector and allowing the ESI register to default to the DS register.) The MOVS instruction moves the string element addressed by the ESI register to the location addressed by the EDI register.
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS at higher addresses and work toward lower ones, or they can begin at lower addresses and work toward higher ones. The DF flag in the EFLAGS register controls whether the registers are incremented (DF = 0) or decremented (DF = 1). The STD and CLD instructions set and clear this flag, respectively.
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS The block I/O instructions (INS and OUTS) instructions move blocks of data (strings) between an I/O port and memory. These instructions operate similar to the string instructions (see Section 7.3.9, “String Operations”). The ESI and EDI registers are used to specify string elements in memory and the repeat prefixes (REP) are used to repeat the instructions to implement block moves.
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS uses the flag in an operation is executed. They are also used in conjunction with the rotate-with-carry instructions (RCL and RCR). The STD (set direction flag) and CLD (clear direction flag) instructions allow the DF flag in the EFLAGS register to be modified directly. The DF flag determines the direction in which index registers ESI and EDI are stepped when executing string processing instructions.
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS The POPFD instruction pops a doubleword into the EFLAGS register. This instruction can change the state of the AC bit (bit 18) and the ID bit (bit 21), as well as the bits affected by a POPF instruction. The restrictions for changing the IOPL bits and the IF flag that were given for the POPF instruction also apply to the POPFD instruction. 7.3.14.
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS segment registers (DS, ES, FS, GS, and SS). The transfers are always made to or from a segment register and a general-purpose register or memory. Transfers between segment registers are not supported. The POP and MOV instructions cannot place a value in the CS register. Only the far control-transfer versions of the JMP, CALL, and RET instructions (see Section 7.3.16.2, “Far Control Transfer Instructions”) affect the CS register directly. 7.3.16.
PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS • • Processor identification NOP and undefined instruction entry 7.3.17.1 Address Computation Instruction The LEA (load effective address) instruction computes the effective address in memory (offset within a segment) of a source operand and places it in a generalpurpose register. This instruction can interpret any of the processor’s addressing modes and can perform any indexing or scaling that may be needed.
CHAPTER 8 PROGRAMMING WITH THE X87 FPU The x87 Floating-Point Unit (FPU) provides high-performance floating-point processing capabilities for use in graphics processing, scientific, engineering, and business applications. It supports the floating-point, integer, and packed BCD integer data types and the floating-point processing algorithms and exception handling architecture defined in the IEEE Standard 754 for Binary Floating-Point Arithmetic.
PROGRAMMING WITH THE X87 FPU The x87 FPU executes instructions from the processor’s normal instruction stream. The state of the x87 FPU is independent from the state of the basic execution environment and from the state of SSE/SSE2/SSE3 extensions. However, the x87 FPU and Intel MMX technology share state because the MMX registers are aliased to the x87 FPU data registers.
PROGRAMMING WITH THE X87 FPU Data Registers Sign 79 78 R7 0 64 63 Exponent Significand R6 R5 R4 R3 R2 R1 R0 0 15 47 0 Control Register Last Instruction Pointer Status Register Last Data (Operand) Pointer Tag Register 10 0 Opcode Figure 8-1. x87 FPU Execution Environment The x87 FPU instructions treat the eight x87 FPU data registers as a register stack (see Figure 8-2). All addressing of the data registers is relative to the register on the top of the stack.
PROGRAMMING WITH THE X87 FPU FPU Data Register Stack 7 6 Growth Stack 5 4 ST(1) Top 3 ST(0) 011B ST(2) 2 1 0 Figure 8-2. x87 FPU Data Register Stack If a load operation is performed when TOP is at 0, register wraparound occurs and the new value of TOP is set to 7. The floating-point stack-overflow exception indicates when wraparound might cause an unsaved value to be overwritten (see Section 8.5.1.1, “Stack Overflow or Underflow Exception (#IS)”).
PROGRAMMING WITH THE X87 FPU Computation Dot Product = (5.6 x 2.4) + (3.8 x 10.3) Code: FLD value1 FMUL value2 FLD value3 FMUL value4 FADD ST(1) (a) ;(a) value1 = 5.6 ;(b) value2 = 2.4 ; value3 = 3.8 ;(c)value4 = 10.3 ;(d) (c) (b) (d) R7 R7 R7 R7 R6 R6 R6 R6 R5 R5 R5 R5 R4 ST(0) R4 13.44 ST(1) R4 13.44 ST( 39.14 ST(0) R3 52.58 ST R4 5.6 ST(0) 13.44 R3 R3 R3 R2 R2 R2 R2 R1 R1 R1 R1 R0 R0 R0 R0 Figure 8-3.
PROGRAMMING WITH THE X87 FPU 8.1.3 x87 FPU Status Register The 16-bit x87 FPU status register (see Figure 8-4) indicates the current state of the x87 FPU. The flags in the x87 FPU status register include the FPU busy flag, top-ofstack (TOP) pointer, condition code flags, error summary status flag, stack fault flag, and exception flags. The x87 FPU sets the flags in this register to show the results of operations.
PROGRAMMING WITH THE X87 FPU are used principally for conditional branching and for storage of information used in exception handling (see Section 8.1.4, “Branching and Conditional Moves on Condition Codes”). As shown in Table 8-1, the C1 condition code flag is used for a variety of functions. When both the IE and SF flags in the x87 FPU status word are set, indicating a stack overflow or underflow exception (#IS), the C1 flag distinguishes between overflow (C1 = 1) and underflow (C1 = 0).
PROGRAMMING WITH THE X87 FPU Table 8-1. Condition Code Interpretation Instruction C0 C3 C2 FCOM, FCOMP, FCOMPP, FICOM, FICOMP, FTST, FUCOM, FUCOMP, FUCOMPP Result of Comparison FCOMI, FCOMIP, FUCOMI, FUCOMIP Undefined. (These instructions set the status flags in the EFLAGS register.
PROGRAMMING WITH THE X87 FPU 8.1.3.4 Stack Fault Flag The stack fault flag (bit 6 of the x87 FPU status word) indicates that stack overflow or stack underflow has occurred with data in the x87 FPU data register stack. The x87 FPU explicitly sets the SF flag when it detects a stack overflow or underflow condition, but it does not explicitly clear the flag when it detects an invalid-arithmeticoperand condition.
PROGRAMMING WITH THE X87 FPU x87 FPU Status Word 15 Condition Status Flag Code C0 C1 C2 C3 CF (none) PF ZF C 3 0 C C C 2 1 0 FSTSW AX Instruction AX Register 15 C 3 0 C C C 2 1 0 SAHF Instruction 31 EFLAGS Register 7 0 Z F P C F 1 F Figure 8-5. Moving the Condition Codes to the EFLAGS Register The new mechanism is available beginning with the P6 family processors.
PROGRAMMING WITH THE X87 FPU Infinity Control Rounding Control Precision Control 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 X RC PC P U O Z D I M M M M M M Exception Masks Precision Underflow Overflow Zero Divide Denormal Operand Invalid Operation Reserved Figure 8-6.
PROGRAMMING WITH THE X87 FPU Table 8-2. Precision Control Field (PC) Precision PC Field Single Precision (24 bits) 00B Reserved 01B Double Precision (53 bits) 10B Double Extended Precision (64 bits) 11B The double precision and single precision settings reduce the size of the significand to 53 bits and 24 bits, respectively. These settings are provided to support IEEE Standard 754 and to provide compatibility with the specifications of certain existing programming languages.
PROGRAMMING WITH THE X87 FPU . 15 0 TAG(7) TAG(6) TAG(5) TAG(4) TAG(3) TAG(2) TAG(1) TAG(0) TAG Values 00 — Valid 01 — Zero 10 — Special: invalid (NaN, unsupported), infinity, or denormal 11 — Empty Figure 8-7. x87 FPU Tag Word Each tag in the x87 FPU tag word corresponds to a physical register (numbers 0 through 7). The current top-of-stack (TOP) pointer stored in the x87 FPU status word can be used to associate tags with registers relative to ST(0).
PROGRAMMING WITH THE X87 FPU Note that the value in the x87 FPU data pointer register is always a pointer to a memory operand, If the last non-control instruction that was executed did not have a memory operand, the value in the data pointer register is undefined (reserved).
PROGRAMMING WITH THE X87 FPU 7 1st Instruction Byte 2 10 2nd Instruction Byte 0 7 0 8 7 0 x87 FPU Opcode Register Figure 8-8. Contents of x87 FPU Opcode Registers The fopcode compatibility mode should be enabled only when x87 FPU floating-point exception handlers are designed to use the fopcode to analyze program performance or restart a program after an exception has been handled. 8.1.
PROGRAMMING WITH THE X87 FPU 31 32-Bit Protected Mode Format 16 15 0 Control Word 0 Status Word 4 Tag Word 8 FPU Instruction Pointer Offset 00000 Opcode 10...00 12 FPU Instruction Pointer Selector 16 20 FPU Operand Pointer Offset 24 FPU Operand Pointer Selector For instructions that also store x87 FPU data registers, the eight 80-bit registers (R0-R7) follow the above structure in sequence. Figure 8-9.
PROGRAMMING WITH THE X87 FPU 16-Bit Protected Mode Format 0 15 Control Word 0 Status Word 2 Tag Word 4 FPU Instruction Pointer Offset 6 FPU Instruction Pointer Selector 8 FPU Operand Pointer Offset 10 FPU Operand Pointer Selector 12 Figure 8-11. Protected Mode x87 FPU State Image in Memory, 16-Bit Format 16-Bit Real-Address Mode and Virtual-8086 Mode Format 0 15 Control Word 0 Status Word 2 Tag Word FPU Instruction Pointer 15...00 IP 19..16 0 Opcode 10...00 FPU Operand Pointer 15...
PROGRAMMING WITH THE X87 FPU 8.2 X87 FPU DATA TYPES The x87 FPU recognizes and operates on the following seven data types (see Figures 8-13): single-precision floating point, double-precision floating point, double extended-precision floating point, signed word integer, signed doubleword integer, signed quadword integer, and packed BCD decimal integers. For detailed information about these data types, see Section 4.2.2, “Floating-Point Data Types,” Section 4.2.1.2, “Signed Integers,” and Section 4.
PROGRAMMING WITH THE X87 FPU Single-Precision Floating-Point Sign Exp.
PROGRAMMING WITH THE X87 FPU 4-4 for the encoding of the integer indefinite, QNaN floating-point indefinite, and packed BCD integer indefinite, respectively. The binary integer encoding 100..
PROGRAMMING WITH THE X87 FPU Table 8-3. Unsupported Double Extended-Precision Floating-Point Encodings and Pseudo-Denormals Class Positive Pseudo-NaNs Positive Floating Point Negative Floating Point Negative Pseudo-NaNs Significand Sign Biased Exponent Integer Fraction Quiet 0 . 0 11..11 . 11..11 0 11..11 . 10..00 Signaling 0 . 0 11..11 . 11..11 0 01..11 . 00..01 Pseudo-infinity 0 11..11 0 00..00 Unnormals 0 . 0 11..10 . 00..01 0 11..11 . 00..00 Pseudo-denormals 0 . 0 00..
PROGRAMMING WITH THE X87 FPU 8.3 X86 FPU INSTRUCTION SET The floating-point instructions that the x87 FPU supports can be grouped into six functional categories: • • • • • • Data transfer instructions Basic arithmetic instructions Comparison instructions Transcendental instructions Load constant instructions x87 FPU control instructions See Section 5.2, “x87 FPU Instructions,” for a list of the floating-point instructions by category.
PROGRAMMING WITH THE X87 FPU • Store the value in an ST(0) register to memory in floating-point, integer, or packed BCD format. • Move values between registers in the x87 FPU register stack. The FLD (load floating point) instruction pushes a floating-point operand from memory onto the top of the x87 FPU data-register stack. If the operand is in singleprecision or double-precision floating-point format, it is automatically converted to double extended-precision floating-point format.
PROGRAMMING WITH THE X87 FPU status flags in the EFLAGS register. The condition code mnemonics are appended to the letters “FCMOV” to form the mnemonic for a FCMOVcc instruction. Table 8-5.
PROGRAMMING WITH THE X87 FPU set in the x87 FPU status word if the value is rounded up. See Section 8.3.8, “Pi,” for information on the π constant. 8.3.5 Basic Arithmetic Instructions The following floating-point instructions perform basic arithmetic operations on floating-point numbers.
PROGRAMMING WITH THE X87 FPU Reverse versions of the subtract (FSUBR) and divide (FDIVR) instructions enable efficient coding.
PROGRAMMING WITH THE X87 FPU FTST FXAM Test (compare floating point with 0.0). Examine. Comparison of floating-point values differ from comparison of integers because floating-point values have four (rather than three) mutually exclusive relationships: less than, equal, greater than, and unordered. The unordered relationship is true when at least one of the two values being compared is a NaN or in an unsupported format.
PROGRAMMING WITH THE X87 FPU FCOMP instructions, except that they set the status flags (ZF, PF, and CF) in the EFLAGS register to indicate the results of the comparison (see Table 8-7) instead of the x87 FPU condition code flags. The FCOMI and FCOMIP instructions allow condition branch instructions (Jcc) to be executed directly from the results of their comparison. Table 8-7.
PROGRAMMING WITH THE X87 FPU Table 8-8. TEST Instruction Constants for Conditional Branching Order Constant Branch ST(0) > Source Operand 4500H JZ ST(0) < Source Operand 0100H JNZ ST(0) = Source Operand 4000H JNZ Unordered 0400H JNZ 2. Check ordered comparison result.
PROGRAMMING WITH THE X87 FPU 8.3.8 Pi When the argument (source operand) of a trigonometric function is within the range of the function, the argument is automatically reduced by the appropriate multiple of 2π through the same reduction mechanism used by the FPREM and FPREM1 instructions. The internal value of π that the x87 FPU uses for argument reduction and other computations is as follows: π = 0.f ∗ 22 where: f = C90FDAA2 2168C234 C (The spaces in the fraction above indicate 32-bit boundaries.
PROGRAMMING WITH THE X87 FPU Similar versions of π can also be written in double extended-precision floating-point format. When using this two-part π value in an algorithm, parallel computations should be performed on each part, with the results kept separate. When all the computations are complete, the two results can be added together to form the final result.
PROGRAMMING WITH THE X87 FPU the correct and computed (approximate) function values, respectively. The error in ulps is defined to be: ( x ) – F ( x )error = f-------------------------k – 63 2 where k is an integer such that: 1≤2 –k f ( x ) < 2. With the Pentium processor and later IA-32 processors, the worst case error on transcendental functions is less than 1 ulp when rounding to the nearest (even) and less than 1.5 ulps when rounding in other modes.
PROGRAMMING WITH THE X87 FPU control and status words, respectively, in memory (or for an FSTSW/FNSTSW instruction in a general-purpose register). The FSTENV/FNSTENV and FSAVE/FNSAVE instructions save the x87 FPU environment and state, respectively, in memory. The x87 FPU environment includes all the x87 FPU’s control and status registers; the x87 FPU state includes the x87 FPU environment and the data registers in the x87 FPU register stack.
PROGRAMMING WITH THE X87 FPU Section D.2.1.3, “No-Wait x87 FPU Instructions Can Get x87 FPU Interrupt in Window.” When operating a P6 family, Pentium 4, or Intel Xeon processor in MS-DOS compatibility mode, non-waiting instructions can not be interrupted in this way (see Section D.2.2, “MS-DOS Compatibility Sub-mode in the P6 Family and Pentium 4 Processors”). 8.3.
PROGRAMMING WITH THE X87 FPU various classes of floating-point exceptions. This information pertains to x87 FPU as well as SSE/SSE2/SSE3 extensions. The following sections give specific information about how the x87 FPU handles floating-point exceptions that are unique to the x87 FPU. 8.4.1 Arithmetic vs. Non-arithmetic Instructions When dealing with floating-point exceptions, it is useful to distinguish between arithmetic instructions and non-arithmetic instructions.
PROGRAMMING WITH THE X87 FPU Table 8-9. Arithmetic and Non-arithmetic Instructions (Contd.) Non-arithmetic Instructions Arithmetic Instructions FSTSW/FNSTSW FRNDINT WAIT/FWAIT FSCALE FXAM FSIN FXCH FSINCOS FSQRT FST/FSTP (single and double) FSUB/FSUBP/FSUBR/FSUBRP FTST FUCOM/FUCOMP/FUCOMPP FXTRACT FYL2X/FYL2XP1 NOTE: 1. The FISTTP instruction in SSE3 is an arithmetic x87 FPU instruction. 8.
PROGRAMMING WITH THE X87 FPU Note that the x87 FPU explicitly sets the SF flag when it detects a stack overflow or underflow condition, but it does not explicitly clear the flag when it detects an invalidarithmetic-operand condition. As a result, the state of the SF flag can be 1 following an invalid-arithmetic-operation exception, if it was not cleared from the last time a stack overflow or underflow condition occurred. See Section 8.1.3.4, “Stack Fault Flag,” for more information about the SF flag. 8.5.
PROGRAMMING WITH THE X87 FPU 8.5.1.2 Invalid Arithmetic Operand Exception (#IA) The x87 FPU is able to detect a variety of invalid arithmetic operations that can be coded in a program. These operations are listed in Table 8-10. (This list includes the invalid operations defined in IEEE Standard 754.) When the x87 FPU detects an invalid arithmetic operand, it sets the IE flag (bit 0) in the x87 FPU status word to 1.
PROGRAMMING WITH THE X87 FPU Table 8-10. Invalid Arithmetic Operations and the Masked Responses to Them (Contd.) FIST/FISTP: Converted value exceeds representable integer range of the destination operand, or source value is an SNaN, QNaN, ±∞, or in an unsupported format. Store integer indefinite value in the destination operand. FXCH: one or both registers are tagged empty. Load empty registers with the QNaN floatingpoint indefinite value, then perform the exchange.
PROGRAMMING WITH THE X87 FPU 8.5.3 Divide-By-Zero Exception (#Z) The x87 FPU reports a floating-point divide-by-zero exception whenever an instruction attempts to divide a finite non-zero operand by 0. The flag (ZE) for this exception is bit 2 of the x87 FPU status word, and the mask bit (ZM) is bit 2 of the x87 FPU control word.
PROGRAMMING WITH THE X87 FPU The action that the x87 FPU takes when numeric overflow occurs and the numericoverflow exception is not masked, depends on whether the instruction is supposed to store the result in memory or on the register stack. • Destination is a memory location — The OE flag is set and a software exception handler is invoked (see Section 8.7, “Handling x87 FPU Exceptions in Software”). The top-of-stack pointer (TOP) and source and destination operands remain unchanged.
PROGRAMMING WITH THE X87 FPU The flag (UE) for the numeric-underflow exception is bit 4 of the x87 FPU status word, and the mask bit (UM) is bit 4 of the x87 FPU control word. When a numeric-underflow condition occurs and the exception is masked, the x87 FPU performs the operation described in Section 4.9.1.5, “Numeric Underflow Exception (#U).
PROGRAMMING WITH THE X87 FPU The inexact-result exception flag (PE) is bit 5 of the x87 FPU status word, and the mask bit (PM) is bit 5 of the x87 FPU control word. If the inexact-result exception is masked when an inexact-result condition occurs and a numeric overflow or underflow condition has not occurred, the x87 FPU handles the exception as describe in Section 4.9.1.6, “Inexact-Result (Precision) Exception (#P),” with one additional action.
PROGRAMMING WITH THE X87 FPU masked floating-point exceptions, because the x87 FPU always returns a masked result to the destination operand.) When a floating-point exception is unmasked and the exception condition occurs, the x87 FPU stops further execution of the floating-point instruction and signals the exception event.
PROGRAMMING WITH THE X87 FPU lutely insure that any exceptions emanating from the FSQRT instruction are handled (for example, prior to a procedure call), a WAIT instruction can be placed directly after the FSQRT instruction. Note that some floating-point instructions (non-waiting instructions) do not check for pending unmasked exceptions (see Section 8.3.11, “x87 FPU Control Instructions”). They include the FNINIT, FNSTENV, FNSAVE, FNSTSW, FNSTCW, and FNCLEX instructions.
PROGRAMMING WITH THE X87 FPU tion handler is provided to support the floating-point exception handling mechanism used in PC systems that are running the MS-DOS or Windows* 95 operating system. The MS-DOS compatibility mode is typically used as follows to invoke the floatingpoint exception handler: 1. If the x87 FPU detects an unmasked floating-point exception, it sets the flag for the exception and the ES flag in the x87 FPU status word. 2.
PROGRAMMING WITH THE X87 FPU FPU can be saved with the FSTENV/FNSTENV or FSAVE/FNSAVE instructions (see Section 8.1.10, “Saving the x87 FPU’s State with FSTENV/FNSTENV and FSAVE/FNSAVE”). If the faulting floating-point instruction is followed by one or more non-floating-point instructions, it may not be useful to re-execute the faulting instruction. See Section 8.6, “x87 FPU Exception Synchronization,” for more information on synchronizing floating-point exceptions.
PROGRAMMING WITH THE X87 FPU 8-48 Vol.
CHAPTER 9 PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY The Intel MMX technology was introduced into the IA-32 architecture in the Pentium II processor family and Pentium processor with MMX technology. The extensions introduced in MMX technology support a single-instruction, multiple-data (SIMD) execution model that is designed to accelerate the performance of advanced media and communications applications. This chapter describes MMX technology. 9.
PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY • Chapter 11, “Intel® MMX™ Technology System Programming,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B, describes the manner in which MMX technology is integrated into the IA-32 system programming model. 9.2 THE MMX TECHNOLOGY PROGRAMMING ENVIRONMENT Figure 9-1 shows the execution environment for MMX technology.
PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY 9.2.2 MMX Registers The MMX register set consists of eight 64-bit registers (see Figure 9-2), that are used to perform calculations on the MMX packed integer data types. Values in MMX registers have the same format as a 64-bit quantity in memory. The MMX registers have two data access modes: 64-bit access mode and 32-bit access mode.
PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY 9.2.
PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY operations on byte, word, and doubleword data elements when contained in MMX registers. The SIMD execution model supported in the MMX technology directly addresses the needs of modern media, communications, and graphics applications, which often use sophisticated algorithms that perform the same operations on a large number of small data types (bytes, words, and doublewords). For example, most audio data is represented in 16-bit (word) quantities.
PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY 7FFFH, which is the largest positive integer that can be represented in 16 bits; if negative overflow occurs, the result is saturated to 8000H. • Unsigned saturation arithmetic — With unsigned saturation arithmetic, outof-range results are limited to the representable range of unsigned integers for the integer size. So, positive overflow when operating on unsigned byte integers results in FFH being returned and negative overflow results in 00H being returned. .
PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY NOTES The MMX instructions described in this chapter are those instructions that are available in an IA-32 processor when CPUID.01H:EDX.MMX[bit 23] = 0. Section 10.4.4, “SSE 64-Bit SIMD Integer Instructions,” and Section 11.4.2, “SSE2 64-Bit and 128-Bit SIMD Integer Instructions,” list additional instructions included with SSE/SSE2 extensions that operate on the MMX registers but are not considered part of the MMX instruction set. Table 9-2.
PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY Table 9-2. MMX Instruction Set Summary (Contd.) Category Wraparound Signed Saturation Doubleword Transfers Data Transfer Empty MMX State 9.4.
PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY The PMULHW (multiply packed signed integers and store high result) and PMULLW (multiply packed signed integers and store low result) instructions perform a signed multiply of the corresponding words of the source and destination operands and write the high-order or low-order 16 bits of each of the results, respectively, to the destination operand.
PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY 9.4.6 Logical Instructions PAND (bitwise logical AND), PANDN (bitwise logical AND NOT), POR (bitwise logical OR), and PXOR (bitwise logical exclusive OR) perform bitwise logical operations on the quadword source and destination operands. 9.4.7 Shift Instructions The logical shift left, logical shift right and arithmetic shift right instructions shift each element by a specified number of bit positions.
PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY 9.5.1 MMX Instructions and the x87 FPU Tag Word After each MMX instruction, the entire x87 FPU tag word is set to valid (00B). The EMMS instruction (empty MMX state) sets the entire x87 FPU tag word to empty (11B). Chapter 11, “Intel® MMX™ Technology System Programming,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, provides additional information about the effects of x87 FPU and MMX instructions on the x87 FPU tag word.
PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY 9.6.2 Transitions Between x87 FPU and MMX Code Applications can contain both x87 FPU floating-point and MMX instructions. However, because the MMX registers are aliased to the x87 FPU register stack, care must be taken when making transitions between x87 FPU instructions and MMX instructions to prevent incoherent or unexpected results.
PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY • When an application using MMX instructions calls a x87 FPU floating-point library/DLL (use the EMMS instruction before calling the x87 FPU code). • When a switch is made between MMX code in a task or thread and other tasks or threads in cooperative operating systems, unless it is certain that more MMX instructions will be executed before any x87 FPU code.
PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY 9.6.6 Using MMX Code in a Multitasking Operating System Environment An application needs to identify the nature of the multitasking operating system on which it runs. Each task retains its own state which must be saved when a task switch occurs. The processor state (context) consists of the general-purpose registers and the floating-point and MMX registers.
PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY 9.6.9 Effect of Instruction Prefixes on MMX Instructions Table 9-3 describes the effect of instruction prefixes on MMX instructions. Unpredictable behavior can range from being treated as a reserved operation on one generation of IA-32 processors to generating an invalid opcode exception on another generation of processors. Table 9-3.
PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY 9-16 Vol.
CHAPTER 10 PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) The streaming SIMD extensions (SSE) were introduced into the IA-32 architecture in the Pentium III processor family. These extensions enhance the performance of IA-32 processors for advanced 2-D and 3-D graphics, motion video, image processing, speech recognition, audio synthesis, telephony, and video conferencing. This chapter describes SSE.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) • Instructions that support explicit prefetching of data, control of the cacheability of data, and control the ordering of store operations. • Extensions to the CPUID instruction.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) 10.2 SSE PROGRAMMING ENVIRONMENT Figure 10-1 shows the execution environment for the SSE extensions. All SSE instructions operate on the XMM registers, MMX registers, and/or memory as follows: • XMM registers — These eight registers (see Figure 10-2 and Section 10.2.2, “XMM Registers”) are used to operate on packed or scalar single-precision floating-point data.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) SSE instructions and are referenced as EAX, EBX, ECX, EDX, EBP, ESI, EDI, and ESP. • EFLAGS register — This 32-bit register (see Figure 3-8) is used to record result of some compare operations. 10.2.1 SSE in 64-Bit Mode and Compatibility Mode In compatibility mode, SSE extensions function like they do in protected mode. In 64-bit mode, eight additional XMM registers are accessible. Registers XMM8-XMM15 are accessed by using REX prefixes.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) integer operands (see Section 11.2, “SSE2 Programming Environment,” and Section 12.1, “SSE3/SSSE3 Programming Environment and Data types”). XMM registers can only be used to perform calculations on data; they cannot be used to address memory. Addressing memory is accomplished by using the generalpurpose registers. Data can be loaded into XMM registers or written from the registers to memory in 32-bit, 64-bit, and 128-bit increments.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) 31 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Reserved F Z R C P U O Z D I D P U O Z D I A M M M M M M E E E E E E Z Flush to Zero Rounding Control Precision Mask Underflow Mask Overflow Mask Divide-by-Zero Mask Denormal Operation Mask Invalid Operation Mask Denormals Are Zeros* Precision Flag Underflow Flag Overflow Flag Divide-by-Zero Flag Denormal Flag Invalid Operation Flag * The denormals-are-zeros flag was introduced in the Pentium 4 and Intel Xeon
PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) “Rounding,” for a description of the function and encoding of the rounding control bits. 10.2.3.3 Flush-To-Zero Bit 15 (FZ) of the MXCSR register enables the flush-to-zero mode, which controls the masked response to a SIMD floating-point underflow condition.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) Section 11.6.3, “Checking for the DAZ Flag in the MXCSR Register,” for instructions for detecting the availability of this feature. Attempting to set bit 6 of the MXCSR register on processors that do not support the DAZ flag will cause a general-protection exception (#GP). See Section 11.6.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) a scalar single-precision floating-point value into a doubleword integer (see Figure 11-8). SSE extensions provide conversion instructions between XMM registers and MMX registers, and between XMM registers and general-purpose bit registers. See Figure 11-8. The address of a 128-bit packed memory operand must be aligned on a 16-byte boundary, except in the following cases: • • The MOVUPS instruction supports unaligned accesses.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) X3 X2 Y3 X1 Y2 X0 Y1 OP OP OP X3 OP Y3 X2 OP Y2 X1 OP Y1 Y0 OP X0 OP Y0 Figure 10-5. Packed Single-Precision Floating-Point Operation The scalar single-precision floating-point instructions operate on the low (least significant) doublewords of the two source operands (X0 and Y0); see Figure 10-6. The three most significant doublewords (X1, X2, and X3) of the first source operand are passed through to the destination.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) The MOVAPS (move aligned packed single-precision floating-point values) instruction transfers a double quadword operand containing four packed single-precision floating-point values from memory to an XMM register and vice versa, or between XMM registers. The memory address must be aligned to a 16-byte boundary; otherwise, a general-protection exception (#GP) is generated.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) tively, the low single-precision floating-point values of two operands and store the result in the low doubleword of the destination operand. The MULPS (multiply packed single-precision floating-point values) instruction multiplies two packed single-precision floating-point operands.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) The MINPS (return minimum of packed single-precision floating-point values) instruction compares the corresponding values from two packed single-precision floating-point operands and returns the numerically lesser value from each comparison to the destination operand.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) The COMISS (compare scalar single-precision floating-point values and set EFLAGS) and UCOMISS (unordered compare scalar single-precision floating-point values and set EFLAGS) instructions compare the low values of two packed single-precision floating-point operands and set the ZF, PF, and CF flags in the EFLAGS register to show the result (greater than, less than, equal, or unordered).
PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) DEST X3 X2 X1 X0 SRC Y3 Y2 Y1 Y0 DEST Y3 X3 Y2 X2 Figure 10-8. UNPCKHPS Instruction, High Unpack and Interleave Operation The UNPCKLPS (unpack and interleave low packed single-precision floating-point values) instruction performs an interleaved unpack of the low-order single-precision floating-point values from the source and destination operands and stores the result in the destination operand (see Figure 10-9).
PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) The CVTSI2SS (convert doubleword integer to scalar single-precision floating-point value) instruction converts a signed doubleword integer into a single-precision floating-point value. When the conversion is inexact, the result is rounded according to the rounding mode selected in the MXCSR register.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) The PMAXUB (maximum of packed unsigned byte integers) instruction compares the corresponding unsigned byte integers in two packed operands and returns the greater of each comparison to the destination operand. The PMINUB (minimum of packed unsigned byte integers) instruction compares the corresponding unsigned byte integers in two packed operands and returns the lesser of each comparison to the destination operand.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) 10.4.6.1 Cacheability Control Instructions The following three instructions enable data from the MMX and XMM registers to be stored to memory using a non-temporal hint. The non-temporal hint directs the processor to when possible store the data to memory without writing the data into the cache hierarchy. See Section 10.4.6.2, “Caching of Temporal vs. Non-Temporal Data,” for information about non-temporal stores and hints.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) The memory type of the region being written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in uncacheable memory. Uncacheable as referred to here means that the region being written to has been mapped with either an uncacheable (UC) or write protected (WP) memory type.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) Table 10-1.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) NOTE The FXSAVE and FXRSTOR instructions are not considered part of the SSE instruction group. They have a separate CPUID feature bit to indicate whether they are present (if CPUID.01H:EDX.FXSR[bit 24] = 1). The CPUID feature bit for SSE extensions does not indicate the presence of FXSAVE and FXRSTOR. 10.6 HANDLING SSE INSTRUCTION EXCEPTIONS See Section 11.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) 10-22 Vol.
CHAPTER 11 PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) The streaming SIMD extensions 2 (SSE2) were introduced into the IA-32 architecture in the Pentium 4 and Intel Xeon processors. These extensions enhance the performance of IA-32 processors for advanced 3-D graphics, video decoding/encoding, speech recognition, E-commerce, Internet, scientific, and engineering applications.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) • Modifications to existing IA-32 instructions to support SSE2 features: — Extensions and modifications to the CPUID instruction — Modifications to the RDPMC instruction These new features extend the IA-32 architecture’s SIMD programming model in three important ways: • They provide the ability to perform SIMD operations on pairs of packed doubleprecision floating-point values.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) • Chapter 12, “System Programming for Streaming SIMD Instruction Sets,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, gives guidelines for integrating the SSE and SSE2 extensions into an operatingsystem environment. 11.2 SSE2 PROGRAMMING ENVIRONMENT Figure 11-1 shows the programming environment for SSE2 extensions. No new registers or other instruction execution state are defined with SSE2 extensions.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) more information on the functions of these flags see Section 10.2.3.4, “Denormals-Are-Zeros,” and Section 10.2.3.3, “Flush-To-Zero.” • MMX registers — These eight registers (see Figure 9-2) are used to perform operations on 64-bit packed integer data. They are also used to hold operands for some operations performed between MMX and XMM registers. MMX registers are referenced by the names MM0 through MM7.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) 11.2.3 Denormals-Are-Zeros Flag The denormals-are-zeros flag (bit 6 in the MXCSR register) was introduced into the IA-32 architecture with the SSE2 extensions. See Section 10.2.3.4, “Denormals-AreZeros,” for a description of this flag. 11.3 SSE2 DATA TYPES SSE2 extensions introduced one 128-bit packed floating-point data type and four 128-bit SIMD integer data types to the IA-32 architecture (see Figure 11-2).
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) The address of a 128-bit packed memory operand must be aligned on a 16-byte boundary, except in the following cases: • • a MOVUPD instruction which supports unaligned accesses scalar instructions that use an 8-byte memory operand that is not subject to alignment requirements Figure 4-2 shows the byte order of 128-bit (double quadword) and 64-bit (quadword) data types in memory. 11.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) X1 Y1 X0 Y0 OP OP X1 OP Y1 X0 OP Y0 Figure 11-3. Packed Double-Precision Floating-Point Operations The scalar double-precision floating-point instructions operate on the low (least significant) quadwords of two source operands (X0 and Y0), as shown in Figure 11-4. The high quadword (X1) of the first source operand is passed through to the destination.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) 11.4.1.1 Data Movement Instructions Data movement instructions move double-precision floating-point data between XMM registers and between XMM registers and memory. The MOVAPD (move aligned packed double-precision floating-point) instruction transfers a 128-bit packed double-precision floating-point operand from memory to an XMM register or vice versa, or between XMM registers.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) The MULPD (multiply packed double-precision floating-point values) instruction multiplies two packed double-precision floating-point operands. The MULSD (multiply scalar double-precision floating-point values) instruction multiplies the low double-precision floating-point values of two operands and stores the result in the low quadword of the destination operand.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) The ANDNPD (bitwise logical AND NOT of packed double-precision floating-point values) instruction returns the logical AND NOT of two packed double-precision floating-point operands. The ORPD (bitwise logical OR of packed double-precision floating-point values) instruction returns the logical OR of two packed double-precision floating-point operands.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) the two packed double-precision floating-point values from source operand in the high quadword of the destination operand (see Figure 11-5). By using the same register for the source and destination operands, the SHUFPD instruction can swap two packed double-precision floating-point values. DEST X1 SRC Y1 DEST Y1 or Y0 X0 Y0 X1 or X0 Figure 11-5.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) DEST X1 X0 SRC Y1 Y0 DEST Y0 X0 Figure 11-7. UNPCKLPD Instruction, Low Unpack and Interleave Operation 11.4.1.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) 2 Doubleword Integer (XMM/mem) C C VT VT P TP D2 D DQ 2D Q C VT D Q 2P D CVTSD2SS CVTPD2PS C VT PI 2P S D 2P PI VT C I 2S SI SD D2 VT S C TT V C D 2S SI VT C C CV VT TT PD PD 2P 2P I I 4 Doubleword Integer (XMM/mem) CVTSS2SD CVTPS2PD 2 Doubleword Integer (MMX/mem) Doubleword Integer (r32/mem) CV CV TPS TT 2D PS Q 2D Q S 2P PI S2 2PI P T S CV TTP CV Q D VT C SI I S2 2S S T SS CV TT S CV 2S SI T CV Single-Precision Floating Point (XMM/mem)
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) The CVTPD2DQ (convert packed double-precision floating-point values to packed doubleword integers) instruction converts two packed double-precision floating-point numbers to two packed signed doubleword integers, with the result stored in the low quadword of an XMM register. When rounding an integer value, the source value is rounded according to the rounding mode selected in the MXCSR register.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) operands in XMM registers or memory (the latter for at most one source operand). When the conversion is inexact, the rounded value according to the rounding mode selected in the MXCSR register is returned. 11.4.2 SSE2 64-Bit and 128-Bit SIMD Integer Instructions SSE2 extensions add several 128-bit packed integer instructions to the IA-32 architecture. Where appropriate, a 64-bit version of each of these instructions is also provided.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) The PSHUFD (shuffle packed doubleword integers) instruction shuffles the doubleword integers packed into the source operand and stores the shuffled result in the destination operand. An 8-bit immediate operand specifies the shuffle order. The PSLLDQ (shift double quadword left logical) instruction shifts the contents of the source operand to the left by the amount of bytes specified by an immediate operand.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) 11.4.4.1 FLUSH Cache Line The CLFLUSH (flush cache line) instruction writes and invalidates the cache line associated with a specified linear address. The invalidation is for all levels of the processor’s cache hierarchy, and it is broadcast throughout the cache coherency domain. NOTE CLFLUSH was introduced with the SSE2 extensions. However, the instruction can be implemented in IA-32 processors that do not implement the SSE2 extensions.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) 11.4.4.4 Pause The PAUSE instruction is provided to improve the performance of “spin-wait loops” executed on a Pentium 4 or Intel Xeon processor. On a Pentium 4 processor, it also provides the added benefit of reducing processor power consumption while executing a spin-wait loop. It is recommended that a PAUSE instruction always be included in the code sequence for a spin-wait loop. 11.4.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) 11.5.1 SIMD Floating-Point Exceptions SIMD floating-point exceptions are those exceptions that can be generated by SSE/SSE2/SSE3 instructions that operate on packed or scalar floating-point operands.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) 11.5.2.1 Invalid Operation Exception (#I) The floating-point invalid-operation exception (#I) occurs in response to an invalid arithmetic operand. The flag (IE) and mask (IM) bits for the invalid operation exception are bits 0 and 7, respectively, in the MXCSR register.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) Table 11-1. Masked Responses of SSE/SSE2/SSE3 Instructions to Invalid Arithmetic Operations (Contd.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) 11.5.2.3 Divide-By-Zero Exception (#Z) The processor reports a divide-by-zero exception when a DIVPS, DIVSS, DIVPD or DIVSD instruction attempts to divide a finite non-zero operand by 0. The flag (ZE) and mask (ZM) bits for the divide-by-zero exception are bits 2 and 9, respectively, in the MXCSR register. See Section 4.9.1.3, “Divide-By-Zero Exception (#Z),” for more information about the divide-by-zero exception. See Section 11.5.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) To-Zero”). The numeric underflow exception is not affected by the denormals-arezero mode. See Section 4.9.1.5, “Numeric Underflow Exception (#U),” for more information about the numeric underflow exception. See Section 11.5.4, “Handling SIMD Floating-Point Exceptions in Software,” for information on handling unmasked exceptions. 11.5.2.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) and continuing program execution. The masked result may be a rounded normalized value, signed infinity, a denormal finite number, zero, a QNaN floating-point indefinite, or a QNaN depending on the exception condition detected. In most cases, the corresponding exception flag bit in MXCSR is also set. The one situation where an exception flag is not set is when an underflow condition is detected and it is not accompanied by an inexact result.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) numeric underflow, inexact result, and numeric overflow are OR’d and the corresponding flags are set in the MXCSR register.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) would be generated because the multiplications of X0 and Y0 and of X1 and Y1 are exact). 11.5.3.3 Handling Combinations of Masked and Unmasked Exceptions In situations where both masked and unmasked exceptions are detected, the processor will set exception flags for the masked and the unmasked exceptions.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) • An application that expects to detect x87 FPU exceptions that occur during the execution of x87 FPU instructions will not be notified if exceptions occurs during the execution of corresponding SSE/SSE2/SSE31 instructions, unless the exception masks that are enabled in the x87 FPU control word have also been enabled in the MXCSR register and the application is capable of handling SIMD floating-point exceptions (#XF).
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) • Use stack and data alignment techniques to keep data properly aligned for efficient memory use. • Use the non-temporal store instructions offered with the SSE and SSE2 extensions. • Employ the optimization and scheduling techniques described in the Intel Pentium 4 Optimization Reference Manual (see Section 1.4, “Related Literature,” for the order number for this manual). 11.6.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) early steppings. To check for the presence of the DAZ flag in the MXCSR register, do the following: 1. Establish a 512-byte FXSAVE area in memory. 2. Clear the FXSAVE area to all 0s. 3. Execute the FXSAVE instruction, using the address of the first byte of the cleared FXSAVE area as a source operand.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) Table 11-2. SSE and SSE2 State Following a Power-up/Reset or INIT Registers XMM0 through XMM7 MXCSR Power-Up or Reset INIT +0.0 Unchanged 1F80H Unchanged If the processor is reset by asserting the INIT# pin, the SSE and SSE2 state is not changed. 11.6.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) Developer’s Manual, Volume 2A, for a description of FXSAVE and the layout of the FXSAVE image. 4. Check the value in the MXCSR_MASK field in the FXSAVE image (bytes 28 through 31). — If the value of the MXCSR_MASK field is 00000000H, then the MXCSR_MASK value is the default value of 0000FFBFH.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) majority of its floating-point computations in the XMM registers, using the packed and scalar floating-point instructions, and at the same time use the x87 FPU to perform trigonometric and other transcendental computations. Likewise, an application can perform packed 64-bit and 128-bit SIMD integer operations together without restrictions.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) arithmetic operation on the data in an XMM register, it does not check that the data being operated on matches the data type specified in the instruction. As a general rule, because data typing of SIMD floating-point and integer data types is not enforced at the architectural level, it is the responsibility of the programmer, assembler, or compiler to insure that code enforces data typing.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) Latency penalties can also be incurred by using move instructions of the wrong type. For example, MOVAPS and MOVAPD can both be used to move a packed single-precision operand from memory to an XMM register. However, if MOVAPD is used, a latency penalty will be incurred when a correctly typed instruction attempts to use the data in the register. Note that these latency penalties are not incurred when moving data from XMM registers to memory. 11.6.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) Use the LDMXCSR and STMXCSR instructions to save and restore, respectively, the contents of the MXCSR register on a procedure call and return. 11.6.10.3 Caller-Save Requirement for Procedure and Function Calls When making procedure (or function) calls from SSE or SSE2 code, a caller-save convention is recommended for saving the state of the calling procedure.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) • Use of the 64-bit shift by bit instructions (PSRLQ, PSLLQ) can be extended to 128 bits in either of two ways: — Use of PSRLQ and PSLLQ, along with masking logic operations. — Rewriting the code sequence to use PSRLDQ and PSLLDQ (shift double quadword operand by bytes) • Loop counters need to be updated, since each 128-bit SIMD integer instruction operates on twice the amount of data as its 64-bit SIMD integer counterpart. 11.6.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) Temporal Data,” and Section 10.4.6.1, “Cacheability Control Instructions”). They prevent non-temporal data from being written into processor caches on a store operation. These instructions are implementation specific. Programmers may have to tune their applications for each IA-32 processor implementation to take advantage of these instructions.
PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) Table 11-3. Effect of Prefixes on SSE, SSE2, and SSE3 Instructions Prefix Type Address Size Prefix (67H) Effect on SSE, SSE2 and SSE3 Instructions Affects instructions with a memory operand. Reserved for instructions without a memory operand and may result in unpredictable behavior. Operand Size (66H) Reserved and may result in unpredictable behavior. Segment Override (2EH,36H,3EH,26H,64H,65H) Affects instructions with a memory operand.
CHAPTER 12 PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3 The Pentium 4 processor supporting Hyper-Threading Technology introduces Streaming SIMD Extensions 3 (SSE3). The Intel Xeon processor 5100 series, Intel Core 2 processor families introduced Supplemental Streaming SIMD Extensions 3 (SSSE3). This chapter describes SSE3/SSSE3 and provides information to assist in writing application programs that use these extensions. 12.
PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3 12.1.2 Compatibility of SSE3/SSSE3 with MMX Technology, the x87 FPU Environment, and SSE/SSE2 Extensions SSE3/SSSE3 do not introduce any new state to the Intel 64 and IA-32 execution environments. For SIMD and x87 programming, the FXSAVE and FXRSTOR instructions save and restore the architectural states of XMM, MXCSR, x87 FPU, and MMX registers.
PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3 X1 X0 Y1 Y0 ADD ADD Y0 + Y1 X0 + X1 Figure 12-2. Horizontal Data Movement in HADDPD 12.2 OVERVIEW OF SSE3 INSTRUCTIONS SSE3 extensions include 13 instructions. See: • Section 12.3, “SSE3 Instructions,” provides an introduction to individual SSE3 instructions. • Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 2A & 2B, provide detailed information on individual instructions.
PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3 • Thread synchronization instructions — Two instructions that improve synchronization between multi-threaded agents The instructions are discussed in more detail in the following paragraphs. 12.3.1 x87 FPU Instruction for Integer Conversion The FISTTP instruction (x87 FPU Store Integer and Pop with Truncation) behaves like FISTP, but uses truncation regardless of what rounding mode is specified in the x87 FPU control word.
PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3 — OperandB (128 bits, four data elements): 3b, 2b, 1b, 0b — Result (stored in OperandA): 2b, 2b, 0b, 0b The MOVDDUP instruction loads/moves 64-bits; duplicating the 64 bits from the source. • MOVDDUP OperandA, OperandB — OperandA (128 bits, two data elements): 1a, 0a — OperandB (64 bits, one data element): 0b — Result (stored in OperandA): 0b, 0b 12.3.
PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3 elements of the second operand; and the fourth by adding the third and fourth elements of the second operand. • HADDPS OperandA, OperandB — OperandA (128 bits, four data elements): 3a, 2a, 1a, 0a — OperandB (128 bits, four data elements): 3b, 2b, 1b, 0b — Result (Stored in OperandA): 3b+2b, 1b+0b, 3a+2a, 1a+0a The HSUBPS instruction performs a single-precision subtraction on contiguous data elements.
PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3 12.3.6 Two Thread Synchronization Instructions The MONITOR instruction sets up an address range that is used to monitor writeback-stores. MWAIT enables a logical processor to enter into an optimized state while waiting for a write-back-store to the address range set up by MONITOR. MONITOR and MWAIT require the use of general purpose registers for its input.
PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3 application attempts to use the MONITOR and MWAIT instructions, the application should use the following steps: 1. Check that the processor supports MONITOR and MWAIT. If CPUID.01H:ECX.MONITOR[bit 3] = 1, MONITOR and MWAIT are available at ring 0. 2. To verify MONITOR and MWAIT is supported at ring level greater than 0, use a routine similar to Example 12-2. 3. Query the smallest and largest line size that MONITOR uses. Use CPUID.05H:EAX.smallest[bits 15:0];EBX.
PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3 // if we get here, MONITOR/MWAIT is not available MONITOR_MWAIT_works = FALSE; } 12.4.3 Enable FTZ and DAZ for SIMD Floating-Point Computation Enabling the FTZ and DAZ flags in the MXCSR register is likely to accelerate SIMD floating-point computation where strict compliance to the IEEE standard 754-1985 is not required.
PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3 12.6 SSSE3 INSTRUCTIONS SSSE3 instructions include: • • • Twelve instructions that perform horizontal addition or subtraction operations. • Two instructions that accelerate packed-integer multiply operations and produce integer values with scaling. • Two instructions that perform a byte-wise, in-place shuffle according to the second shuffle control operand.
PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3 There are six horizontal add instructions (represented by three mnemonics); three operate on 128-bit operands and three operate on 64-bit operands. The width of each data element is either 16 bits or 32 bits. The mnemonics are listed below. • PHADDW adds two adjacent, signed 16-bit integers horizontally from the source and destination operands and packs the signed 16-bit results to the destination operand.
PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3 12.6.3 Multiply and Add Packed Signed and Unsigned Bytes There are two multiply-and-add-packed-signed-unsigned-byte instructions (represented by one mnemonic). One operates on 128-bit operands and the other operates on 64-bit operands. Multiplications are performed on each vertical pair of data elements. The data elements in the source operand are signed byte values, the input data elements of the destination operand are unsigned byte values.
PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3 12.6.6 Packed Sign There are six packed-sign instructions (represented by three mnemonics). Three operate on 128-bit operands and three operate on 64-bit operands. The widths of each data element for these instructions are 8 bit, 16 bit or 32 bit signed integers. • PSIGNB/W/D negates each signed integer element of the destination operand if the sign of the corresponding data element in the source operand is less than zero. 12.6.
PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3 12.7.2 Checking for SSSE3 Support Before an application attempts to use the SIMD subset of SSSE3 extensions, the application should follow the steps illustrated in Section 11.6.2, “Checking for SSE/SSE2 Support.” Next, use the additional step provided below: • Check that the processor supports the SIMD and x87 SSSE3 extensions (if CPUID.01H:ECX.SSSE3[bit 9] = 1). See Example 12-3 for a code example. Example 12-3.
PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3 12.8.2 Numeric Error flag and IGNNE# Most SSE3 instructions ignore CR0.NE[bit 5] (treats it as if it were always set) and the IGNNE# pin. With one exception, all use the vector 19 software exception for error reporting. The exception is FISTTP; it behaves like other x87-FP instructions. SSSE3 instructions ignore CR0.NE[bit 5] (treats it as if it were always set) and the IGNNE# pin. SSSE3 instructions do not cause floating-point errors. 12.8.
PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3 12-16 Vol.
CHAPTER 13 INPUT/OUTPUT In addition to transferring data to and from external memory, IA-32 processors can also transfer data to and from input/output ports (I/O ports). I/O ports are created in system hardware by circuity that decodes the control, data, and address pins on the processor. These I/O ports are then configured to communicate with peripheral devices. An I/O port can be an input port, an output port, or a bidirectional port.
INPUT/OUTPUT I/O address space is selected, it is the responsibility of the hardware to decode the memory-I/O bus transaction to select I/O ports rather than memory. Data is transmitted between the processor and an I/O device through the data lines. 13.3 I/O ADDRESS SPACE The processor’s I/O address space is separate and distinct from the physical-memory address space. The I/O address space consists of 216 (64K) individually addressable 8-bit I/O ports, numbered 0 through FFFFH.
INPUT/OUTPUT uncacheable (UC). See Chapter 10, “Memory Cache Control,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, for a complete discussion of the MTRRs. The Pentium and Intel486 processors do not support MTRRs. Instead, they provide the KEN# pin, which when held inactive (high) prevents caching of all addresses sent out on the system bus. To use this pin, external address decoding logic is required to block caching in specific address spaces.
INPUT/OUTPUT • Those that transfer strings of items (strings of bytes, words, or doublewords) between an I/O port and memory The register I/O instructions IN (input from I/O port) and OUT (output to I/O port) move data between I/O ports and the EAX register (32-bit I/O), the AX register (16-bit I/O), or the AL (8-bit I/O) register. The address of the I/O port can be given with an immediate value or a value in the DX register.
INPUT/OUTPUT ilege level needed to perform I/O. In a typical protection ring model, access to the I/O address space is restricted to privilege levels 0 and 1. Here, kernel and the device drivers are allowed to perform I/O, while less privileged device drivers and application programs are denied access to the I/O address space. Application programs must then make calls to the operating system to perform I/O.
INPUT/OUTPUT Task State Segment (TSS) Last byte of bit map must be followed by a byte with all bits set I/O map base must not exceed DFFFH. 31 24 23 0 1 1 1 1 1 1 1 1 I/O Permission Bit Map I/O Map Base 64H 0 Figure 13-2. I/O Permission Bit Map Because each task has its own TSS, each task has its own I/O permission bit map. Access to individual I/O ports can thus be granted to individual tasks.
INPUT/OUTPUT If the I/O bit map base address is greater than or equal to the TSS segment limit, there is no I/O permission map, and all I/O instructions generate exceptions when the CPL is greater than the current IOPL. 13.6 ORDERING I/O When controlling I/O devices it is often important that memory and I/O operations be carried out in precisely the order programmed. For example, a program may write a command to an I/O port, then read the status of the I/O device from another I/O port.
INPUT/OUTPUT When the I/O address space is used instead of memory-mapped I/O, the situation is different in two respects: • The processor never buffers I/O writes. Therefore, strict ordering of I/O operations is enforced by the processor. (As with memory-mapped I/O, it is possible for a chip set to post writes in certain I/O ranges.) • The processor synchronizes I/O instruction execution with external bus activity (see Table 13-1). Table 13-1.
CHAPTER 14 PROCESSOR IDENTIFICATION AND FEATURE DETERMINATION When writing software intended to run on IA-32 processors, it is necessary to identify the type of processor present in a system and the processor features that are available to an application. 14.1 USING THE CPUID INSTRUCTION Use the CPUID instruction for processor identification in the Pentium M processor family, Pentium 4 processor family, Intel Xeon processor family, P6 family, Pentium processor, and later Intel486 processors.
PROCESSOR IDENTIFICATION AND FEATURE DETERMINATION • Test feature identification flags individually and do not make assumptions about undefined bits. 14.1.2 Identification of Earlier IA-32 Processors The CPUID instruction is not available in earlier IA-32 processors up through the earlier Intel486 processors. For these processors, several other architectural features can be exploited to identify the processor.
APPENDIX A EFLAGS CROSS-REFERENCE A.1 EFLAGS AND INSTRUCTIONS Table A-2 summarizes how the instructions affect the flags in the EFLAGS register. The following codes describe how the flags are affected. Table A-1. Codes Describing Flags T Instruction tests flag. M Instruction modifies flag (either sets or resets depending on operands). 0 Instruction resets flag. 1 Instruction sets flag. — Instruction's effect on flag is undefined. R Instruction restores prior value of flag.
EFLAGS CROSS-REFERENCE Table A-2. EFLAGS Cross-Reference (Contd.
EFLAGS CROSS-REFERENCE Table A-2. EFLAGS Cross-Reference (Contd.
EFLAGS CROSS-REFERENCE Table A-2. EFLAGS Cross-Reference (Contd.
EFLAGS CROSS-REFERENCE Table A-2. EFLAGS Cross-Reference (Contd.) Instruction OF SF ZF AF PF CF TF IF DF NT RF UD2 VERR/VERRW M WAIT WBINVD WRMSR XADD M M M M M M 0 M M — M 0 XCHG XLAT XOR Vol.
EFLAGS CROSS-REFERENCE A-6 Vol.
APPENDIX B EFLAGS CONDITION CODES B.1 CONDITION CODES Table B-1 lists condition codes that can be queried using CMOVcc, FCMOVcc, Jcc, and SETcc. Condition codes refer to the setting of one or more status flags (CF, OF, SF, ZF, and PF) in the EFLAGS register. In the table below: • The “Mnemonic” column provides the suffix (cc) added to the instruction to specify a test condition. • • “Condition Tested For” describes the targeted condition. • “Status Flags Setting” describes the flag setting.
EFLAGS CONDITION CODES Table B-1. EFLAGS Condition Codes (Contd.) Instruction Subcode Status Flags Setting No parity Parity odd 1011 PF = 0 L NGE Less Neither greater nor equal 1100 (SF xOR OF) = 1 NL GE Not less Greater or equal 1101 (SF xOR OF) = 0 LE NG Less or equal Not greater 1110 ((SF XOR OF) OR ZF) = 1 NLE G Neither less nor equal Greater 1111 ((SF XOR OF) OR ZF) = 0 Mnemonic (cc) Condition Tested For NP PO Many of the test conditions are described in two different ways.
APPENDIX C FLOATING-POINT EXCEPTIONS SUMMARY C.1 OVERVIEW This appendix shows which of the floating-point exceptions can be generated for: • • • • x87 FPU instructions — see Table C-2 SSE instructions — see Table C-3 SSE2 instructions — see Table C-4 SSE3 instructions — see Table C-5 Table C-1 lists types of floating-point exceptions that potentially can be generated by the x87 FPU and by SSE/SSE2/SSE3 instructions. Table C-1.
FLOATING-POINT EXCEPTIONS SUMMARY C.2 X87 FPU INSTRUCTIONS Table C-2 lists the x87 FPU instructions in alphabetical order. For each instruction, it summarizes the floating-point exceptions that the instruction can generate. Table C-2.
FLOATING-POINT EXCEPTIONS SUMMARY Table C-2. Exceptions Generated with x87 FPU Floating-Point Instructions (Contd.) Mnemonic FLD extended or stack Instruction Load floating-point #IS #IA #D #Z #O #U #P Y FLD single or double Load floating-point Y FLD1 Load + 1.0 Y Y Y FLDCW Load Control word Y Y Y Y Y Y Y FLDENV Load environment Y Y Y Y Y Y Y FLDL2E Load log2e Y FLDL2T Load log210 Y FLDLG2 Load log102 Y FLDLN2 Load loge2 Y FLDPI Load π Y FLDZ Load + 0.
FLOATING-POINT EXCEPTIONS SUMMARY Table C-2. Exceptions Generated with x87 FPU Floating-Point Instructions (Contd.) Mnemonic Instruction #IS #IA FUCOM(P)(P) Unordered compare floatingpoint FWAIT CPU Wait FXAM Examine FXCH Exchange registers Y FXTRACT Extract FYL2X FYL2XP1 C.
FLOATING-POINT EXCEPTIONS SUMMARY Table C-3. Exceptions Generated with SSE Instructions (Contd.) Mnemonic Instruction #I #D #Z #O #U #P CVTPS2PI Convert lower two SP FP from XMM/Mem to two 32-bit signed integers in MM using rounding specified by MXCSR. Y CVTSI2SS Convert one 32-bit signed integer from Integer Reg/Mem to one SP FP. CVTSS2SI Convert one SP FP from XMM/Mem to one 32-bit signed integer using rounding mode specified by MXCSR, and move the result to an integer register.
FLOATING-POINT EXCEPTIONS SUMMARY Table C-3. Exceptions Generated with SSE Instructions (Contd.) Mnemonic Instruction #I #D #Z #O #U #P MOVLPS Move two packed SP values between memory and the low half of an XMM register. MOVMSKPS Move sign mask to r32. MOVSS Move scalar SP number between an XMM register and memory or a second XMM register. MOVUPS Move unaligned packed data. MULPS Packed multiply. Y Y Y Y Y MULSS Scalar multiply. Y Y Y Y Y ORPS Packed OR.
FLOATING-POINT EXCEPTIONS SUMMARY C.4 SSE2 INSTRUCTIONS Table C-4 lists SSE2 instructions with at least one of the following characteristics: • • floating-point operands floating point results For each instruction, the table summarizes the floating-point exceptions that the instruction can generate. Table C-4. Exceptions Generated with SSE2 Instructions Instruction Description #I #D ADDPD Add two packed DP FP numbers from XMM2/Mem to XMM1.
FLOATING-POINT EXCEPTIONS SUMMARY Table C-4. Exceptions Generated with SSE2 Instructions (Contd.) Instruction Description CVTTPS2DQ Convert four SP FP from XMM/Mem to four 32-bit signed integers in XMM using truncate. CVTDQ2PD Convert two 32-bit signed integers in XMM2/Mem to 2 DP FP in xmm1 using rounding specified by MXCSR. CVTPD2DQ #I #D #Z #O #U #P Y Y Convert two DP FP from XMM2/Mem to two 32-bit signed integers in xmm1 using rounding specified by MXCSR.
FLOATING-POINT EXCEPTIONS SUMMARY Table C-4. Exceptions Generated with SSE2 Instructions (Contd.) Instruction Description #I #D #Z #O #U #P CVTTPD2DQ Convert two DP FP from XMM2/Mem to two 32-bit signed integers in XMM1 using truncate. Y Y CVTTPD2PI Convert two DP FP from XMM2/Mem to two 32-bit signed integers in MM1 using truncate. Y Y CVTTSD2SI Convert lowest DP FP from XMM/Mem to one 32 bit signed integer using truncate, and move the result to an integer register.
FLOATING-POINT EXCEPTIONS SUMMARY Table C-4. Exceptions Generated with SSE2 Instructions (Contd.) Instruction MOVHPD Description #I #D #Z #O #U #P Move 64 bits representing one DP operand from Mem to upper field of XMM register. Or move 64 bits representing one DP operand from upper field of XMM register to Mem. MOVLPD Move 64 bits representing one DP operand from Mem to lower field of XMM register. Or move 64 bits representing one DP operand from lower field of XMM register to Mem.
FLOATING-POINT EXCEPTIONS SUMMARY Table C-4. Exceptions Generated with SSE2 Instructions (Contd.) Instruction Description #I #D #Z #O #U #P SUBPD Subtract Packed DoublePrecision. Y Y Y Y Y SUBSD Subtract Scaler DoublePrecision. Y Y Y Y Y UCOMISD Compare lower DP FP number in XMM1 register with lower DP FP number in XMM2/Mem and set the status flags accordingly. Y Y UNPCKHPD Interleaves DP FP numbers from the high halves of XMM1 and XMM2/Mem into XMM1 register.
FLOATING-POINT EXCEPTIONS SUMMARY Table C-5. Exceptions Generated with SSE3 Instructions (Contd.) Instruction Description #O #U #P FISTTP See Table C-2. Y HADDPD Add horizontally packed DP FP numbers XMM2/Mem to XMM1.
APPENDIX D GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS As described in Chapter 8, “Programming with the x87 FPU,” the IA-32 Architecture supports two mechanisms for accessing exception handlers to handle unmasked x87 FPU exceptions: native mode and MS-DOS compatibility mode.
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS D.1 MS-DOS COMPATIBILITY SUB-MODE FOR HANDLING X87 FPU EXCEPTIONS The first generations of IA-32 processors (starting with the Intel 8086 and 8088 processors and going through the Intel 286 and Intel386 processors) did not have an on-chip floating-point unit. Instead, floating-point capability was provided on a separate numeric coprocessor chip.
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS as the Intel 286 and 287 processors. And again, to maintain compatibility with existing MS-DOS software, basically the same MS-DOS compatibility floating-point exception handling mechanism that was used in the IBM PC AT was used in PCs based on the Intel386 processor. D.
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS Note that Intel, in order to provide Intel486 processors for market segments that had no need for an x87 FPU, created the “SX” versions. These Intel486 SX processors did not contain the floating-point unit. Intel also produced Intel 487 SX processors for end users who later decided to upgrade to a system with an x87 FPU. These Intel 487 SX processors are similar to standard Intel486 processors with a working x87 FPU on board.
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS Some x87 FPU instructions with some x87 FPU exceptions use an “immediate” method of reporting errors. Here, the FERR# is asserted immediately, at the time that the exception occurs. The immediate method of error reporting is used for x87 FPU stack fault, invalid operation and denormal exceptions caused by all transcendental instructions, FSCALE, FXTRACT, FPREM and others, and all exceptions (except precision) when caused by x87 FPU store instructions.
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS Figure D-1 is implemented. The temporal regions within the x87 FPU exception handler activity are described as follows: 1. The FERR# signal is activated by an x87 FPU exception and sends an interrupt request through the PIC to the processor’s INTR pin. 2. During the x87 FPU interrupt service routine (exception handler) the processor will need to clear the interrupt request latch (Flip Flop #1).
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS * 5(6(7 , 2 3RUW ) + $GGUHVV 'HFRGH 9 )) )(55 ,QWHO 3URFHVVRU 3HQWLXP 3URFHVVRU 3HQWLXP 3UR 3URFHVVRU 35 9 9 &/5 )) 35 9 ,*11( ,175 ,QWHUUXSW &RQWUROOHU )3B,54 /(*(1' )) Q )OLS )ORS Q &/5 &OHDU RU 5HVHW Figure D-1.
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS inactive. So if the handler clears the x87 FPU exception condition before the 0F0H access, IGNNE# does not get activated and left on after exit from the handler. 0F0H Address Decode Figure D-2. Behavior of Signals During x87 FPU Exception Handling D.2.1.
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS which is not an explicitly documented behavior of a no-wait instruction. This process is illustrated in Figure D-3. Exception Generating Floating-Point Instruction Assertion of FERR# by the Processor Start of the “No-Wait” Floating-Point Instruction System Dependent Delay Case 1 External Interrupt Sampling Window Assertion of INTR Pin by the System Case 2 Window Closed Figure D-3.
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS There are two other ways, in addition to Case 1 above, in which a no-wait floatingpoint instruction can service a numeric exception inside its interrupt window. First, the first floating-point error condition could be of the “immediate” category (as defined in Section D.2.1.1, “Basic Rules: When FERR# Is Generated”) that asserts FERR# immediately.
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS FERR# is asserted as soon as the x87 FPU detects an unmasked exception; there are no cases in which error reporting is deferred to the next x87 FPU or WAIT instruction. (As is discussed in Section D.2.1.1, “Basic Rules: When FERR# Is Generated,” most exception cases in the Intel486 and Pentium processors are of the deferred type.
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS D.3.1 Floating-Point Exceptions and Their Defaults The x87 FPU can recognize six classes of floating-point exception conditions while executing floating-point instructions: 1. #I — Invalid operation #IS — Stack fault #IA — IEEE standard invalid operation 2. #Z — Divide-by-zero 3. #D — Denormalized operand 4. #O — Numeric overflow 5. #U — Numeric underflow 6.
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS is masked (the corresponding mask bit in the control word = 1), the processor takes an appropriate default action and continues with the computation. The processor has a default fix-up activity for every possible exception condition it may encounter. These masked-exception responses are designed to be safe and are generally acceptable for most numeric applications.
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS least programming effort. Certain exceptions can usefully be left unmasked during the debugging phase of software development, and then masked when the clean software is actually run. An invalid-operation exception for example, typically indicates a program error that must be corrected. The exception flags in the x87 FPU status word provide a cumulative record of exceptions that have occurred since these flags were last cleared.
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS Intel386 processor using the ERROR# status line between the processor and the coprocessor. See Section D.1, “MS-DOS Compatibility Sub-mode for Handling x87 FPU Exceptions,” in this appendix, and Chapter 17, “IA-32 Architecture Compatibility,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, for differences in x87 FPU exception handling. The exception-handling routine is normally a part of the systems software.
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS D.3.3.1 Exception Synchronization: What, Why and When Exception synchronization means that the exception handler inspects and deals with the exception in the context in which it occurred. If concurrent execution is allowed, the state of the processor when it recognizes the exception is often not in the context in which it occurred.
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS invoked, synchronization must always be considered to assure reliable performance. Example D-1 and Example D-2, below, illustrate the need to always consider exception synchronization when writing numeric code, even when the code is initially intended for execution with exceptions masked. D.3.3.
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS causes the processor to freeze immediately before executing such an instruction (unless the IGNNE# input is active, or it is a no-wait x87 FPU instruction). Exactly when the exception handler will be invoked (in the interval between when the exception is detected and the next WAIT or x87 FPU instruction) is dependent on the processor generation, the system, and which x87 FPU instruction and exception is involved.
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS epilogue must not load an unmasked exception flag into the x87 FPU or another exception will be requested immediately. The following code examples show the ASM386/486 coding of three skeleton exception handlers, with the save spaces given as correct for 32-bit protected mode.
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS Example D-3. Full-State Exception Handler SAVE_ALL PROC ; ;SAVE REGISTERS, ALLOCATE STACK SPACE FOR x87 FPU STATE IMAGE PUSH EBP . .
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS POPFD ;RESTORE IF TO VALUE BEFORE x87 FPU EXCEPTION ; ;APPLICATION-DEPENDENT EXCEPTION HANDLING CODE GOES HERE ; ;CLEAR EXCEPTION FLAGS IN STATUS WORD (WHICH IS IN MEMORY) ;RESTORE MODIFIED ENVIRONMENT IMAGE MOV BYTE PTR [EBP-24], 0H FLDENV [EBP-28] ;DE-ALLOCATE STACK SPACE, RESTORE REGISTERS MOV ESP, EBP . . POP EBP ; ;RETURN TO INTERRUPTED CALCULATION IRETD SAVE_ENVIRONMENT ENDP Example D-5. Reentrant Exception Handler . .
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS ;APPLICATION-DEPENDENT EXCEPTION HANDLING CODE ;GOES HERE - AN UNMASKED EXCEPTION ;GENERATED HERE WILL CAUSE THE EXCEPTION HANDLER TO BE REENTERED ;IF LOCAL STORAGE IS NEEDED, IT MUST BE ALLOCATED ON THE STACK . ;CLEAR EXCEPTION FLAGS IN STATUS WORD (WHICH IS IN MEMORY) ;RESTORE MODIFIED STATE IMAGE MOV BYTE PTR [EBP-104], 0H FRSTOR [EBP-108] ;DE-ALLOCATE STACK SPACE, RESTORE REGISTERS MOV ESP, EBP . .
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS ; type will cause the same problem .... FCLEX ; clear the x87 FPU error conditions & thus ; turn off FERR# & reset the IGNNE# FF The problem will only occur if the processor enters SMM between the OUT and the FLDCW instructions. But if that happens, AND the SMM code saves the x87 FPU state using FNSAVE, then the IGNNE# Flip Flop will be cleared (because FNSAVE clears the x87 FPU errors and thus de-asserts FERR#).
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS D.3.6.1 Speculatively Deferring x87 FPU Saves, General Overview In order to support multitasking, each thread in the system needs a save area for the general-purpose registers, and each task that is allowed to use floating-point needs an x87 FPU save area large enough to hold the entire x87 FPU stack and associated x87 FPU state such as the control word and status word. (See Section 8.1.
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS D.3.6.2 Tracking x87 FPU Ownership Since the contents of the x87 FPU may not belong to the currently executing thread, the thread identifier for the last x87 FPU user needs to be tracked separately. This is not complicated; the kernel should simply provide a variable to store the thread identifier of the x87 FPU owner, separate from the variable that stores the identifier for the currently executing thread.
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS FPU owner. A more general flow for a DNA exception handler that handles this case is shown in Figure D-5. Numeric exceptions received while the kernel owns the x87 FPU for a state swap must be discarded in the kernel without being dispatched to a handler. A flow for a numeric exception dispatch routine is shown in Figure D-6.
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS Numeric Exception Entry Is Kernel FPU Owner? Yes No Normal Dispatch to Numeric Exception Handler Exit Figure D-6. Program Flow for a Numeric Exception Dispatch Routine Case #1: x87 FPU State Swap Without Numeric Exception Assume two threads A and B, both using the floating-point unit. Let A be the thread to have most recently executed a floating-point instruction, with no pending numeric exceptions. Let B be the currently executing thread. CR0.
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS DNA handler resumes execution, completing the FNSAVE of the old floating-point context of thread A and the FRSTOR of the floating-point context for thread B. Thread A eventually gets an opportunity to handle the exception that was discarded during the task switch. After some time, thread B is suspended, and thread A resumes execution. When thread A starts to execute an floating-point instruction, once again the DNA exception handler is entered.
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS IA-32 Architectures Software Developer’s Manual, Volume 2A, for information about exceptions generated if the memory region is not aligned). 3. Maintaining compatibility with legacy applications/libraries — The operating system changes to support Streaming SIMD Extensions must be invisible to legacy applications or libraries that deal only with floating-point instructions.
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS Using the dedicated INT 16 for x87 FPU exception handling is referred to as the native mode. It is the simplest approach, and the one recommended most highly by Intel. D.4.2 Changes with Intel486, Pentium and Pentium Pro Processors with CR0.NE[bit 5] = 1 With these latest three generations of the IA-32 architecture, more enhancements and speedup features have been added to the corresponding x87 FPUs.
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS discussed above, FERR# gets asserted independent of the value of the NE bit, but when NE = 1, the operating system should not enable its path through the PIC.) Another possible (very rare) way a floating-point exception interrupt could occur while the kernel is executing is by an x87 FPU immediate exception case having its interrupt delayed by the external hardware until execution has switched to the kernel.
GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS D-32 Vol.
APPENDIX E GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS See Section 11.5, “SSE, SSE2, and SSE3 Exceptions,” for a detailed discussion of SIMD floating-point exceptions. This appendix considers only SSE/SSE2/SSE3 instructions that can generate numeric (SIMD floating-point) exceptions, and gives an overview of the necessary support for handling such exceptions.
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS “Interrupt and Exception Handling,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A). Some compilers use specific run-time libraries to assist in floating-point exception handling.
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS tion Implementation,” assume that the body of the handler (not shown here in detail) passes the saved state to a routine that will examine in turn all the sub-operands of the excepting instruction, invoking a user floating-point exception handler if a particular set of sub-operands raises an unmasked (enabled) exception, or emulating the instruction otherwise. Example 5-1.
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS occur immediately and are not delayed until a subsequent floating-point instruction is executed. However, floating-point emulation may be necessary when unmasked floating-point exceptions are generated. E.
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS operands into up to four sets of sub-operands, and will submit them one set at a time to an emulation function (See Example E-1 in Section E.4.3, “Example SIMD Floating-Point Emulation Implementation”). The emulation function will examine the sub-operands, and will possibly redo the necessary calculation.
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS A diagram of the control flow in handling an unmasked floating-point exception is presented below. User Application Low-Level Floating-Point Exception Handler User Level Floating-Point Exception Filter User Floating-Point Exception Handler Figure E-1. Control Flow for Handling Unmasked Floating-Point Exceptions From the user-level floating-point filter, Example E-1 in Section E.4.
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS “Interrupt and Exception Handling,” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A. E.4.2.1 Numeric Exceptions There are six classes of numeric (floating-point) exception conditions that can occur: Invalid operation (#I), Divide-by-Zero (#Z), Denormal Operand (#D), Numeric Overflow (#O), Numeric Underflow (#U), and Inexact Result (precision) (#P).
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS Table E-1.
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS Table E-2. CMPPS.EQ, CMPSS.EQ, CMPPS.ORD, CMPSS.ORD, CMPPD.EQ, CMPSD.EQ, CMPPD.ORD, CMPSD.ORD Source Operands Masked Result Unmasked Result NaN op Opd2 (any Opd2) 00000000H or 0000000000000000H1 00000000H or 0000000000000000H1 (not an exception) Opd1 op NaN (any Opd1) 00000000H or 0000000000000000H1 00000000H or 0000000000000000H1 (not an exception) NOTE: 1.
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS Table E-5. CMPPS.NLT, CMPSS.NLT, CMPPS.NLE, CMPSS.NLE, CMPPD.NLT, CMPSD.NLT, CMPPD.NLE, CMPSD.NLE Source Operands Masked Result Unmasked Result NaN op Opd2 (any Opd2) FFFFFFFFH or FFFFFFFFFFFFFFFFH1 None Opd1 op NaN (any Opd1) FFFFFFFFH or FFFFFFFFFFFFFFFFH1 None NOTE: 1. 32-bit results are for single, and 64-bit results for double precision operations. Table E-6.
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS Table E-8. CVTPS2PI, CVTSS2SI, CVTTPS2PI, CVTTSS2SI, CVTPD2PI, CVTSD2SI, CVTTPD2PI, CVTTSD2SI, CVTPS2DQ, CVTTPS2DQ, CVTPD2DQ, CVTTPD2DQ Source Operand Masked Result Unmasked Result SNaN 80000000H or 80000000000000001 (Integer Indefinite) None QNaN 80000000H or 80000000000000001 (Integer Indefinite) None NOTE: 1. 32-bit results are for single, and 64-bit results for double precision operations. Table E-9.
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS Table E-11. CVTPS2PD, CVTSS2SD Source Operands Masked Result 1 QNaN QNaN1 SNaN QNaN12 Unmasked Result QNaN11 (not an exception) None NOTES: 1. The double precision output QNaN1 is created from the single precision input QNaN as follows: the sign bit is preserved, the 8-bit exponent FFH is replaced by the 11-bit exponent 7FFH, and the 24-bit significand is extended to a 53-bit significand by appending 29 bits equal to 0. 2.
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS faults), no result is provided to the user handler. For post-computation exceptions (floating-point traps), a result is provided to the user handler, as specified below. In the following tables, the result is denoted by 'res', with the understanding that for the actual instruction, the destination coincides with the first source operand (except for COMISS, UCOMISS, COMISD, and UCOMISD, whose destination is the EFLAGS register). Table E-13.
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS Table E-13. #I - Invalid Operations (Contd.
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS Table E-13. #I - Invalid Operations (Contd.
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS Table E-15.
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS Table E-15. #D - Denormal Operand Instruction Condition Masked Response Unmasked Response and Exception Code CVTSS2SD CVTPD2PS CVTSD2SS NOTE: 1. For denormal encodings, see Section 4.8.3.2, “Normalized and Denormalized Finite Numbers.” Table E-16.
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS Table E-16. #O - Numeric Overflow (Contd.) Instruction Condition Masked Response ADDPD ADDSUBPD HADDPD SUBPD HSUBPD MULPD DIVPD ADDSD SUBSD MULSD DIVSD Rounded result > largest double precision finite normal value Roundi ng To nearest Toward –∞ Toward +∞ Toward 0 E-18 Vol. 1 Sign + - Result & Status Flags #OE = 1, #PE = 1 res = + ∞ res = – ∞ + - #OE = 1, #PE = 1 res = 1.
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS Table E-17. #U - Numeric Underflow Instruction Condition Masked Response ADDPS ADDSUBPS HADDPS SUBPS HSUBPS MULPS DIVPS ADDSS SUBSS MULSS DIVSS CVTPD2PS CVTSD2SS Result calculated with unbounded exponent and rounded to the destination precision < smallest single precision finite normal value.
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS Table E-18.
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS E.4.3 Example SIMD Floating-Point Emulation Implementation The sample code listed below may be considered as being part of a user-level floating-point exception filter for the SSE/SSE2/SSE3 numeric instructions. It is assumed that the filter function is invoked by a low-level exception handler (reached via interrupt vector 19 when an unmasked floating-point exception occurs), and that it operates as explained in Section E.4.
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS The arithmetic operations exemplified are emulated as follows: 1. If the denormals-are-zeros mode is enabled (the DAZ bit in MXCSR is set to 1), replace all the denormal inputs with zeroes of the same sign (the denormal flag is not affected by this change). 2. Perform the operation using x87 FPU instructions, with exceptions disabled, the original user rounding mode, and single precision.
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS Example E-1. SIMD Floating-Point Emulation // masks for individual status word bits #define PRECISION_MASK 0x20 #define UNDERFLOW_MASK 0x10 #define OVERFLOW_MASK 0x08 #define ZERODIVIDE_MASK 0x04 #define DENORMAL_MASK 0x02 #define INVALID_MASK 0x01 // 32-bit constants static unsigned ZEROF_ARRAY[] = {0x00000000}; #define ZEROF *(float *) ZEROF_ARRAY // +0.
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS int uiopd1; // first operand of the add, subtract, multiply, or divide int uiopd2; // second operand of the add, subtract, multiply, or divide float res; // result of the add, subtract, multiply, or divide double dbl_res24; // result with 24-bit significand, but "unbounded" exponent // (needed to check tininess, to provide a scaled result to // an underflow/overflow trap handler, and in flush-to-zero mode) double dbl_res; // result in double pre
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS switch (exc_env->rounding_mode) { case ROUND_TO_NEAREST: cw = 0x003f; // round to nearest, single precision, exceptions masked break; case ROUND_DOWN: cw = 0x043f; // round down, single precision, exceptions masked break; case ROUND_UP: cw = 0x083f; // round up, single precision, exceptions masked break; case ROUND_TO_ZERO: cw = 0x0c3f; // round to zero, single precision, exceptions masked break; default: ; } __asm { fldcw WORD PTR cw; } // comp
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS case MULPS: case MULSS: // perform the multiplication __asm { fnclex; // load input operands fld DWORD PTR uiopd1; // may set denormal or invalid status flags fld DWORD PTR uiopd2; // may set denormal or invalid status flags fmulp st(1), st(0); // may set inexact or invalid status flags // store result fstp QWORD PTR dbl_res24; // exact } break; case DIVPS: case DIVSS: // perform the division __asm { fnclex; // load input operands fld DWORD PTR
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS // // // if also fix for the SSE and SSE2 differences in treating two NaN inputs between the instructions and other IA-32 instructions (isnanf (uiopd1) || isnanf (uiopd2)) { if (isnanf (uiopd1) && isnanf (uiopd2)) exc_env->result_fval = quietf (uiopd1); else exc_env->result_fval = (float)dbl_res24; // exact if (sw & INVALID_MASK) exc_env->status_flag_invalid_operation = 1; return (DO_NOT_RAISE_EXCEPTION); } // if denormal flag set, and denormal
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS // // // // // // // // // // // // at this point, there are no enabled I,D, or Z exceptions to take; the instr.
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS // load input operands fld DWORD PTR uiopd1; // may set the denormal status flag fld DWORD PTR uiopd2; // may set the denormal status flag faddp st(1), st(0); // rounded to 53 bits, may set the inexact // status flag // store result fstp QWORD PTR dbl_res; // exact, will not set any flag } break; case SUBPS: case SUBSS: // perform the subtraction __asm { // load input operands fld DWORD PTR uiopd1; // fld DWORD PTR uiopd2; // fsubp st(1), st(0);
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS ; // will never occur } // calculate result for the case an inexact trap has to be taken, or // when no trap occurs (second IEEE rounding) res = (float)dbl_res; // may set P, U or O; may also involve denormalizing the result // read status word __asm { fstsw WORD PTR sw; } // if inexact traps are enabled and result is inexact, take inexact trap if (!(exc_env->exc_masks & PRECISION_MASK) && ((sw & PRECISION_MASK) || (exc_env->ftz && result_tiny))
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS // read status word to see if result is inexact __asm { fstsw WORD PTR sw; } if (sw & UNDERFLOW_MASK) exc_env->status_flag_underflow = 1; if (sw & OVERFLOW_MASK) exc_env->status_flag_overflow = 1; if (sw & PRECISION_MASK) exc_env->status_flag_inexact = 1; // if ftz = 1, and result is tiny (underflow traps must be disabled), // result = 0.0 if (exc_env->ftz && result_tiny) { if (res > 0.0) res = ZEROF; else if (res < 0.
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS case CVTSS2SI: case CVTTPS2PI: case CVTTSS2SI: ... break; case case case case MAXPS: MAXSS: MINPS: MINSS: ... break; case SQRTPS: case SQRTSS: ... break; ... case UNSPEC: ... break; default: ... } } E-32 Vol.
INDEX Numerics 128-bit packed byte integers data type, 4-12, 11-5 packed double-precision floating-point data type, 4-12, 11-5 packed doubleword integers data type, 4-12 packed quadword integers data type, 4-12 packed SIMD data types, 4-12 packed single-precision floating-point data type, 4-12, 10-8 packed word integers data type, 4-12, 11-5 16-bit address size, 3-11 operand size, 3-11 286 processor, 2-1 32-bit address size, 3-11 operand size, 3-11 64-bit packed byte integers data type, 4-11, 9-4 packed dou
INDEX ADDPD instruction, 11-8 ADDPS instruction, 10-11 Address size attribute code segment, 3-24 description of, 3-24 of stack, 6-3 Address sizes, 3-11 Address space 64-bit mode, 3-2, 3-6 compatibility mode, 3-2 overview of, 3-3 physical, 3-8 Addressing modes assembler, 3-32 base, 3-30, 3-31, 3-32 base plus displacement, 3-31 base plus index plus displacement, 3-32 base plus index time scale plus displacement, 3-32 canonical address, 3-13 displacement, 3-30, 3-31, 3-32 effective address, 3-30 immediate ope
INDEX CMC instruction, 3-22, 7-28 CMOVcc instructions, 7-4, 7-5 CMP instruction, 7-11 CMPPD instruction, 11-10 CMPPS instruction, 10-13 CMPS instruction, 3-22, 7-26 CMPSD instruction, 11-10 CMPSS instruction, 10-13 CMPXCHG instruction, 7-6 CMPXCHG16B instruction, 7-7 CMPXCHG8B instruction, 7-6 Code segment, 3-19 COMISD instruction, 11-10 COMISS instruction, 10-14 Compare compare and exchange, 7-6 integers, 7-11 real numbers, x87 FPU, 8-27 strings, 7-26 Compatibility mode address space, 3-2 branch functions
INDEX operated on by MMX technology, 9-4 operated on by SSE extensions, 10-8 operated on by SSE2 extensions, 11-5 operated on by x87 FPU, 8-18 operated on in 64-bit mode, 4-9 packed bytes, 9-3 packed doublewords, 9-3 packed SIMD, 4-11 packed words, 9-3 pointers, 4-9 quadword, 4-1, 9-3 signed integers, 4-5 strings, 4-10 unsigned integers, 4-5 word, 4-1 DAZ (denormals-are-zeros) flag MXCSR register, 10-7 DE (denormal operand exception) flag MXCSR register, 11-21 x87 FPU status word, 8-7, 8-39 Debug registers
INDEX in real-address mode, 6-17 notation, 1-8 vector, 6-13 Exponent, floating-point number, 4-15 F F2XM1 instruction, 8-31 FABS instruction, 8-25 FADD instruction, 8-25 FADDP instruction, 8-25 Far call description of, 6-5 operation, 6-6 Far pointer 16-bit addressing, 3-11 32-bit addressing, 3-11 64-bit mode, 4-9 description of, 3-8, 4-9 legacy modes, 4-9 Far return operation, 6-6 FBLD instruction, 8-23 FBSTP instruction, 8-23 FCHS instruction, 8-25 FCLEX/FNCLEX instructions, 8-7 FCMOVcc instructions, 8-1
INDEX Flush-to-zero FZ flag, MXCSR register, 10-7, 11-3 mode, 10-7 FMUL instruction, 8-25 FMULP instruction, 8-25 FNOP instruction, 8-32 Fopcode compatibility mode, 8-14 FPATAN instruction, 8-29 FPREM instruction, 8-7, 8-25, 8-30 FPREM1 instruction, 8-7, 8-25, 8-30 FPTAN instruction, 8-7 Fraction, floating-point number, 4-15 FRNDINT instruction, 8-25 FRSTOR instruction, 8-7, 8-13, 8-15, 8-33 FS register, 3-17, 3-19 FSAVE/FNSAVE instructions, 8-6, 8-7, 8-13, 8-15, 8-33 FSCALE instruction, 8-31 FSIN instruct
INDEX Infinity, floating-point format, 4-7, 4-20 INIT pin, 3-20 Input/output (see I/O) INS instruction, 5-8, 7-27, 13-4 Instruction operands, 1-5 Instruction pointer 64-bit mode, 7-2 EIP register, 3-14, 3-24 RIP register, 3-24 RIP, EIP, IP compared, 3-12 x87 FPU, 8-13 Instruction prefixes effect on SSE and SSE2 instructions, 11-37 REX prefix, 3-2, 3-16 Instruction set binary arithmetic instructions, 7-10 bit scan instructions, 7-19 bit test and modify instructions, 7-19 byte-set-on-condition instructions,
INDEX implicit call to interrupt handler procedure, 6-14 implicit call to interrupt handler task, 6-17 in real-address mode, 6-17 maskable, 6-13 user-defined, 6-13 vector, 6-13 INTn instruction, 7-25 INTO instruction, 6-18, 7-25, 7-31 Invalid arithmetic operand exception (#IA) description of, 8-38 masked response to, 8-38 Invalid operation exception (#I) overview, 4-27 SSE and SSE2 extensions, 11-20 x87 FPU, 8-36 IOPL (I/O privilege level) field EFLAGS register, 3-23, 13-4 IRET instruction, 3-24, 6-17, 6-1
INDEX shift instructions, 9-10 MMX registers description of, 9-3 overview of, 3-3 MMX technology 64-bit mode, 9-2 64-bit packed SIMD data types, 4-11 compatibility mode, 9-2 compatibility with FPU architecture, 9-10 data types, 9-3 detecting MMX technology with CPUID instruction, 9-11 effect of instruction prefixes on MMX instructions, 9-15 exception handling in MMX code, 9-14 IA-32e mode, 9-2 instruction set, 5-14, 9-6 interfacing with MMX code, 9-13 introduction to, 9-1 memory data formats, 9-4 mixing MM
INDEX NEG instruction, 7-11 NetBurst microarchitecture (see Intel NetBurst microarchitecture) Non-arithmetic instructions, x87 FPU, 8-35 Non-number encodings, floating-point format, 4-18 Non-temporal data caching of, 10-18 description, 10-18 temporal vs.
INDEX Pentium 4 processor, 1-1 description of, 2-4, 2-5 Pentium 4 processor supporting Hyper-Threading Technology description of, 2-4, 2-5 Pentium II processor, 1-2 description of, 2-3 P6 family microarchitecture, 2-7 Pentium II Xeon processor description of, 2-3 Pentium III processor, 1-2 description of, 2-4 P6 family microarchitecture, 2-7 Pentium III Xeon processor description of, 2-4 Pentium M processor description of, 2-5 instructions supported, 2-5 Pentium Pro processor, 1-2 description of, 2-3 P6 fa
INDEX PSUBSW instruction, 9-8 PSUBUSB instruction, 9-8 PSUBUSW instruction, 9-8 PSUBW instruction, 9-8 PUNPCKHBW instruction, 9-9 PUNPCKHDQ instruction, 9-9 PUNPCKHQDQ instruction, 11-16 PUNPCKHWD instruction, 9-9 PUNPCKLBW instruction, 9-9 PUNPCKLDQ instruction, 9-9 PUNPCKLQDQ instruction, 11-16 PUNPCKLWD instruction, 9-9 PUSH instruction, 6-1, 6-3, 7-7, 7-30 PUSHA instruction, 6-8, 7-7 PUSHF instruction, 3-20, 6-8, 7-29 PUSHFD instruction, 3-20, 6-8, 7-29 PXOR instruction, 9-10 Q QNaN floating-point ind
INDEX RSP register, 3-16, 6-5 RSQRTPS instruction, 10-12 RSQRTSS instruction, 10-12 S SAHF instruction, 3-20, 7-29 SAL instruction, 7-14 SAR instruction, 7-15 Saturation arithmetic (MMX instructions), 9-5 SBB instruction, 7-11 Scalar operations defined, 10-10, 11-7 scalar double-precision FP operands, 11-7 scalar single-precision FP operands, 10-10 Scale (operand addressing), 3-30, 3-32 Scale, x87 FPU operation, 8-31 Scaling bias value, 8-41, 8-42 SCAS instruction, 3-22, 7-26 Segment defined, 3-8 maximum
INDEX cacheability control instructions, 10-18 cacheability hint instructions, 11-36 caller-save requirement for procedure and function calls, 11-35 checking for SSE and SSE2 support, 11-28 comparison instructions, 10-13 compatibility mode, 10-4 compatibility of SIMD and x87 FPU floating-point data types, 11-32 conversion instructions, 10-15 data movement instructions, 10-10 data types, 10-8, 12-1 denormal operand exception (#D), 11-21 denormals-are-zeros mode, 10-7 divide by zero exception (#Z), 11-22 exc
INDEX handling combinations of masked and unmasked exceptions, 11-26 handling masked exceptions, 11-23 handling SIMD floating-point exceptions in software, 11-26 handling unmasked exceptions, 11-25, 11-26 inexact result exception (#P), 11-23 initialization of, 11-29 instruction prefixes, effect on SSE and SSE2 instructions, 11-37 instruction set, 5-21 instructions, 11-6, 12-3, 12-10 interaction of SIMD and x87 FPU floating-point exceptions, 11-26 interaction of SSE and SSE2 instructions with x87 FPU and MM
INDEX MMX technology compatibility, 12-2 multiply and add packed instructions, 12-12 numeric error flag and IGNNE#, 12-15 packed absolute value instructions, 12-11 packed align instruction, 12-13 packed multiply high instructions, 12-12 packed shuffle instruction, 12-12 programming environment, 12-1 SSSE2 compatibility, 12-2 x87 FPU compatibility, 12-2 SSSE3 instructions descriptions of, 12-9 summary of, 5-28 Stack 64-bit mode, 3-6, 6-5 64-bit mode behavior, 6-19 address-size attribute, 6-3 alignment, 6-3
INDEX Unsigned integers description of, 4-5 range of, 4-5 types, 4-5 Unsupported, 8-20 floating-point formats, x87 FPU, 8-20 x87 FPU instructions, 8-34 V Vector (see Interrupt vector) VIF (virtual interrupt) flag, EFLAGS register, 3-23 VIP (virtual interrupt pending) flag EFLAGS register, 3-23 Virtual 8086 mode description of, 3-23 memory model, 3-9, 3-10 VM (virtual 8086 mode) flag, EFLAGS register, 3-23 VMCALL instruction, 5-32 VMCLEAR instruction, 5-31 VMLAUNCH instruction, 5-32 VMPTRLD instruction, 5-
INDEX transcendental, 8-31 transitions between x87 FPU and MMX code, 9-12 trigonometric, 8-29 unsupported, 8-34 x87 FPU status word condition code flags, 8-6 DE flag, 8-39 description of, 8-6 exception flags, 8-7 OE flag, 8-40 PE flag, 8-7 stack fault flag, 8-9 TOP field, 8-3 top of stack (TOP) pointer, 8-6 x87 FPU tag word, 8-12, 9-12 XADD instruction, 7-6 XCHG instruction, 7-5 XLAT/XLATB instruction, 7-32 XMM registers 64-bit mode, 3-6 description, 10-4 FXSAVE and FXRSTOR instructions, 11-34 overview of,
INTEL SALES OFFICES ASIA PACIFIC Australia Intel Corp. Level 2 448 St Kilda Road Melbourne VIC 3004 Australia Fax:613-9862 5599 China Intel Corp. Rm 709, Shaanxi Zhongda Int'l Bldg No.30 Nandajie Street Xian AX710002 China Fax:(86 29) 7203356 Intel Corp. Room 0724, White Rose Hotel No 750, MinZhu Road WuChang District Wuhan UB 430071 China Viet Nam Intel Corp. Hanoi Tung Shing Square, Ste #1106 2 Ngo Quyen St Hoan Kiem District Hanoi Viet Nam India Intel Corp.
Intel Corp. 999 CANADA PLACE, Suite 404,#11 Vancouver BC V6C 3E2 Canada Fax:604-844-2813 Intel Corp. 2650 Queensview Drive, Suite 250 Ottawa ON K2B 8H6 Canada Fax:613-820-5936 Intel Corp. 190 Attwell Drive, Suite 500 Rexcdale ON M9W 6H8 Canada Fax:416-675-2438 Intel Corp. 171 St. Clair Ave. E, Suite 6 Toronto ON Canada Intel Corp. 1033 Oak Meadow Road Oakville ON L6M 1J6 Canada USA California Intel Corp. 551 Lundy Place Milpitas CA 95035-6833 USA Fax:408-451-8266 Intel Corp. 1551 N.