User's Manual

18 Completely Unroll Small Loops

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Completely Unroll Small Loops

Take advantage of the AMD Athlon processor’s large, 64-Kbyte

instruction cache and completely unroll small loops. Unrolling

loops can be beneficial to performance, especially if the loop

body is small which makes the loop overhead significant. Many

compilers are not aggressive at unrolling loops. For loops that

have a small fixed loop count and a small loop body, completely

unrolling the loops at the source level is recommended.

Example 1 (Avoid):

// 3D-transform: multiply vector V by 4x4 transform matrix M

for (i=0; i<4; i++) {

r[i] = 0;

for (j=0; j<4; j++) {

r[i] += M[j][i]*V[j];

}

Example 2 (Preferred):

// 3D-transform: multiply vector V by 4x4 transform matrix M

r[0] = M[0][0]*V[0] + M[1][0]*V[1] + M[2][0]*V[2] +

M[3][0]*V[3];

r[1] = M[0][1]*V[0] + M[1][1]*V[1] + M[2][1]*V[2] +

M[3][1]*V[3];

r[2] = M[0][2]*V[0] + M[1][2]*V[1] + M[2][2]*V[2] +

M[3][2]*V[3];

r[3] = M[0][3]*V[0] + M[1][3]*V[1] + M[2][3]*V[2] +

M[3][3]*v[3];

Avoid Unnecessary Store-to-Load Dependencies

A store-to-load dependency exists when data is stored to

memory, only to be read back shortly thereafter. See

“Store-to-Load Forwarding Restrictions” on page 51 for more

details. The AMD Athlon processor contains hardware to

accelerate such store-to-load dependencies, allowing the load to

obtain the store data before it has been written to memory.

However, it is still faster to avoid such dependencies altogether

and keep the data in an internal register.

Avoiding store-to-load dependencies is especially important if

they are part of a long dependency chains, as might occur in a

recurrence computation. If the dependency occurs while

operating on arrays, many compilers are unable to optimize the