User's Manual

Unrolling Loops 69

22007E/0—November 1999 AMD Athlon™ Processor x86 Code Optimization

Without Loop Unrolling:

MOV ECX, MAX_LENGTH

MOV EAX, OFFSET A

MOV EBX, OFFSET B

$add_loop:

FLD QWORD PTR [EAX]

FADD QWORD PTR [EBX]

FSTP QWORD PTR [EAX]

ADD EAX, 8

ADD EBX, 8

DEC ECX

JNZ $add_loop

The loop consists of seven instructions. The AMD Athlon

processor can decode/retire three instructions per cycle, so it

cannot execute faster than three iterations in seven cycles, or

3/7 floating-point adds per cycle. However, the pipelined

floating-point adder allows one add every cycle. In the following

code, the loop is partially unrolled by a factor of two, which

creates potential endcases that must be handled outside the

loop:

With Partial Loop Unrolling:

MOV ECX, MAX_LENGTH

MOV EAX, offset A

MOV EBX, offset B

SHR ECX, 1

JNC $add_loop

FLD QWORD PTR [EAX]

FADD QWORD PTR [EBX]

FSTP QWORD PTR [EAX]

ADD EAX, 8

ADD EBX, 8

$add_loop:

FLD QWORD PTR[EAX]

FADD QWORD PTR[EBX]

FSTP QWORD PTR[EAX]

FLD QWORD PTR[EAX+8]

FADD QWORD PTR[EBX+8]

FSTP QWORD PTR[EAX+8]

ADD EAX, 16

ADD EBX, 16

DEC ECX

JNZ $add_loop

Now the loop consists of 10 instructions. Based on the

decode/retire bandwidth of three OPs per cycle, this loop goes