User's Manual

92 Efficient Implementation of Population Count Function
AMD Athlon Processor x86 Code Optimization
22007E/0November 1999
Step 3 For the first time, the value in each k-bit field is small enough
that adding two k-bit fields results in a value that still fits in the
k-bit field. Thus the following computation is performed:
y = (x + (x >> 4)) & 0x0F0F0F0F
The result is four 8-bit fields whose lower half has the desired
sum and whose upper half contains "junk" that has to be
masked out. In a symbolic form:
x = 0aaa0bbb0ccc0ddd0eee0fff0ggg0hhh
x >> 4 = 00000aaa0bbb0ccc0ddd0eee0fff0ggg
sum = 0aaaWWWWiiiiXXXXjjjjYYYYkkkkZZZZ
The WWWW, XXXX, YYYY, and ZZZZ values are the
interesting sums with each at most 1000b, or 8 decimal.
Step 4 The four 4-bit sums can now be rapidly accumulated by means
of a multiply with a "magic" multiplier. This can be derived
from looking at the following chart of partial products:
0p0q0r0s * 01010101 =
:0p0q0r0s
0p:0q0r0s
0p0q:0r0s
0p0q0r:0s
000pxxww:vvuutt0s
Here p, q, r, and s are the 4-bit sums from the previous step, and
vv is the final result in which we are interested. Thus, the final
result:
z = (y * 0x01010101) >> 24
Example:
unsigned int popcount(unsigned int v)
{
unsigned int retVal;
__asm {
MOV EAX, [v] ;v
MOV EDX, EAX ;v
SHR EAX, 1 ;v >> 1
AND EAX, 055555555h ;(v >> 1) & 0x55555555
SUB EDX, EAX ;w = v - ((v >> 1) & 0x55555555)
MOV EAX, EDX ;w
SHR EDX, 2 ;w >> 2
AND EAX, 033333333h ;w & 0x33333333
AND EDX, 033333333h ;(w >> 2) & 0x33333333