User's Manual

Unrolling Loops 69
22007E/0November 1999 AMD Athlon Processor x86 Code Optimization
Without Loop Unrolling:
MOV ECX, MAX_LENGTH
MOV EAX, OFFSET A
MOV EBX, OFFSET B
$add_loop:
FLD QWORD PTR [EAX]
FADD QWORD PTR [EBX]
FSTP QWORD PTR [EAX]
ADD EAX, 8
ADD EBX, 8
DEC ECX
JNZ $add_loop
The loop consists of seven instructions. The AMD Athlon
processor can decode/retire three instructions per cycle, so it
cannot execute faster than three iterations in seven cycles, or
3/7 floating-point adds per cycle. However, the pipelined
floating-point adder allows one add every cycle. In the following
code, the loop is partially unrolled by a factor of two, which
creates potential endcases that must be handled outside the
loop:
With Partial Loop Unrolling:
MOV ECX, MAX_LENGTH
MOV EAX, offset A
MOV EBX, offset B
SHR ECX, 1
JNC $add_loop
FLD QWORD PTR [EAX]
FADD QWORD PTR [EBX]
FSTP QWORD PTR [EAX]
ADD EAX, 8
ADD EBX, 8
$add_loop:
FLD QWORD PTR[EAX]
FADD QWORD PTR[EBX]
FSTP QWORD PTR[EAX]
FLD QWORD PTR[EAX+8]
FADD QWORD PTR[EBX+8]
FSTP QWORD PTR[EAX+8]
ADD EAX, 16
ADD EBX, 16
DEC ECX
JNZ $add_loop
Now the loop consists of 10 instructions. Based on the
decode/retire bandwidth of three OPs per cycle, this loop goes