User's Manual

18 Completely Unroll Small Loops
AMD Athlon Processor x86 Code Optimization
22007E/0November 1999
Completely Unroll Small Loops
Take advantage of the AMD Athlon processors large, 64-Kbyte
instruction cache and completely unroll small loops. Unrolling
loops can be beneficial to performance, especially if the loop
body is small which makes the loop overhead significant. Many
compilers are not aggressive at unrolling loops. For loops that
have a small fixed loop count and a small loop body, completely
unrolling the loops at the source level is recommended.
Example 1 (Avoid):
// 3D-transform: multiply vector V by 4x4 transform matrix M
for (i=0; i<4; i++) {
r[i] = 0;
for (j=0; j<4; j++) {
r[i] += M[j][i]*V[j];
}
}
Example 2 (Preferred):
// 3D-transform: multiply vector V by 4x4 transform matrix M
r[0] = M[0][0]*V[0] + M[1][0]*V[1] + M[2][0]*V[2] +
M[3][0]*V[3];
r[1] = M[0][1]*V[0] + M[1][1]*V[1] + M[2][1]*V[2] +
M[3][1]*V[3];
r[2] = M[0][2]*V[0] + M[1][2]*V[1] + M[2][2]*V[2] +
M[3][2]*V[3];
r[3] = M[0][3]*V[0] + M[1][3]*V[1] + M[2][3]*V[2] +
M[3][3]*v[3];
Avoid Unnecessary Store-to-Load Dependencies
A store-to-load dependency exists when data is stored to
memory, only to be read back shortly thereafter. See
Store-to-Load Forwarding Restrictions on page 51 for more
details. The AMD Athlon processor contains hardware to
accelerate such store-to-load dependencies, allowing the load to
obtain the store data before it has been written to memory.
However, it is still faster to avoid such dependencies altogether
and keep the data in an internal register.
Avoiding store-to-load dependencies is especially important if
they are part of a long dependency chains, as might occur in a
recurrence computation. If the dependency occurs while
operating on arrays, many compilers are unable to optimize the