# Count bits set in parallel a.k.a. Population Count

Code
``` 01: unsigned int countBits(unsigned int x) 02: { 03:   // count bits of each 2-bit chunk 04:   x  = x - ((x >> 1) & 0x55555555); 05:   // count bits of each 4-bit chunk 06:   x  = (x & 0x33333333) + ((x >> 2) & 0x33333333); 07:   // count bits of each 8-bit chunk 08:   x  = x + (x >> 4); 09:   // mask out junk 10:   x &= 0xF0F0F0F; 11:   // add all four 8-bit chunks 12:   return (x * 0x01010101) >> 24; 13: } 14: 15: unsigned int countBitsLoop(unsigned int x) 16: { 17:   unsigned int result = 0; 18:   // strip one set bit per iteration 19:   while (x != 0) 20:   { 21:     x &= x-1; 22:     result++; 23:   } 24:   return result; 25: } ``` bits.stephan-brumme.com

Explanation countBits (line 1line 13):
Counting all set bits of an integer was part of many mainframe CPU's assembler language but somehow
x86 CPUs ignored it for decades. Apparently Intel introduced the POPCNT opcode in its Core i7 design.

Meanwhile, the population count has to be implemented by other means.
The main observations lies in the fact that you can subdivide any bitblock into smaller chunks,
compute their population count and add all intermediate results.

First, the code counts the bits of two adjacent bits:

0b and 0b → 00b
0b and 1b → 01b
1b and 0b → 01b
1b and 1b → 10b

The whole algorithm modifies the input in order to generate the output, that means it works in-place.
Line 4 performs the 2-bit count at once based on the observation:

00b → unchanged, still 00b
01b → unchanged, still 01b
10b → must be converted to 01b
11b → must be converted to 10b

Whenever the higher bit of each 2-bit group is set, subtracting 01b gives the desired outcome.
Looks like branching ... but as it turns out, the subtraction can be done always: just subtract the higher bit !
If it is 0, the result remains unchanged, if it is 1, then we get the right numbers, too.
The shift x >> 1 and the following mask of all odd bits (0x55 is 01010101b):

00b → shifted: ?0b → masked: 00b → subtraction: 00b − 00b → 00b
01b → shifted: ?0b → masked: 00b → subtraction: 01b − 00b → 01b
10b → shifted: ?1b → masked: 01b → subtraction: 10b − 01b → 01b
11b → shifted: ?1b → masked: 01b → subtraction: 11b − 01b → 10b

Now the 2-bit count is done. As you can see, there are just three possible decimal results: 0, 1 or 2.

Then, two adjacent 2-bit groups are joined to 4-bit groups (line 6):

00b and 00b → 0000b
00b and 01b → 0001b
00b and 10b → 0010b
01b and 00b → 0001b
01b and 01b → 0010b
01b and 10b → 0011b
10b and 00b → 0010b
10b and 01b → 0011b
10b and 10b → 0100b

This time, the 2-bit groups are masked and shifted to match and then simply added. No overflow is possible.

00b + 00b → 0000b
00b + 01b → 0001b
00b + 10b → 0010b
01b + 00b → 0001b
01b + 01b → 0010b
01b + 10b → 0011b
10b + 00b → 0010b
10b + 01b → 0011b
10b + 10b → 0100b

The same procedure is done for all 4-bit groups yielding the bit counts for each of the four bytes (line 8)
in their lower four bits. That means, each byte contains its bit count, however, the upper four bits may
contain junk and are masked out (line 10).

Multiplying by 0x01010101 has an interesting property if we name the four bytes A, B, C, D:

A, B, C, D → A+B+C+D, B+C+D, C+D, D

Obviously the highest byte is what we are looking for. The right shift (line 12) returns just it.

countBitsLoop (line 15line 25):

A very elegant and sometimes faster algorithm was presented in Brian Kernighan's and Dennis Ritchie's famous book
"The C Programming Language". On the internet I found out that Peter Wegener actually deserves the credit for first publishing it.

Unlike the other algorithms, its performance heavily depends on its input. When less than about four bits are set, it outperforms all
other algorithms but suffers heavily when almost all bits are set.

The main trick, stripping a single bit with x &= x − 1 (line 21), deserves some attention:
• if x = 0, then the while-loop is not entered at all, so we do not need to consider this case
• if the rightmost bit is 1, then the rightmost bit of x − 1 is 0. All other bits are identical and x &= x − 1 → x = x − 1.
Because all other bits are identical we stripped one set bit, the rightmost bit.
• if the rightmost bits are 0, then x looks like this: ...1000. And x − 1 looks like this: ...0111. Result of x &= x-1: ...0000.
Hence, x &= x − 1 clears all bits except for the ones represented as dots, they remain the same. Again, exactly one bit was cleared.
In general, x &= x − 1 always sets the rightmost bit which was 1 to 0.

Restrictions • designed for 32 bits
• except for countBitsLoop, which is fine for all integer bit widths

help me to pay
my server bills

Performance + Intel® Pentium™ D:
countBits: ≈ 23 cycles per number, constant time, data independent
countBitsLoop: ≈ 68 cycles per number (average)
countBitsLoop(0): ≈ 4 cycles per number when 0 bits are set (0% are set)
countBitsLoop(16): ≈ 54 cycles per number when 16 bits are set (50% are set)
countBitsLoop(32): ≈ 145 cycles per number when 32 bits are set (100% are set)
• simple (unrolled for-loop): ≈ 60 cycles per number, constant time, data independent

+ Intel® Core™ 2:
countBits: ≈ 16.5 cycles per number, constant time, data independent
countBitsLoop: ≈ 37.5 cycles per number (average)
countBitsLoop(0): ≈ 2.5 cycles per number when 0 bits are set (0% are set)
countBitsLoop(16): ≈ 37 cycles per number when 16 bits are set (50% are set)
countBitsLoop(32): ≈ 70 cycles per number when 32 bits are set (100% are set)
• simple (unrolled for-loop): ≈ 77 cycles per number, constant time, data independent

+ Intel® Core™ i7:
countBits: ≈ 16 cycles per number, constant time, data independent
countBitsLoop: ≈ 38 cycles per number (average)
countBitsLoop(0): ≈ 2.5 cycles per number when 0 bits are set (0% are set)
countBitsLoop(16): ≈ 37 cycles per number when 16 bits are set (50% are set)
countBitsLoop(32): ≈ 70 cycles per number when 32 bits are set (100% are set)
• simple (unrolled for-loop): ≈ 67 cycles per number, constant time, data independent CPU cycles
(full optimization,
lower values are better)

Assembler Output
``` countBits: 01:   ; x = x - ((x >> 1) & 0x55555555) 02:   mov ecx, eax 03:   shr ecx, 1 04:   and ecx, 55555555h 05:   sub eax, ecx 06:   ; x = (x & 0x33333333) + ((x >> 2) & 0x33333333) 07:   mov ecx, eax 08:   and eax, 33333333h 09:   shr ecx, 2 10:   and ecx, 33333333h 11:   add ecx, eax 12:   ; x = x + (x >> 4) 13:   mov eax, ecx 14:   shr eax, 4 15:   add eax, ecx 16:   ; x &= 0xF0F0F0F 17:   and eax, 0F0F0F0Fh 18:   ; (x * 0x01010101) >> 24 19:   imul eax, 01010101h 20:   shr eax, 24 countBitsLoop: 01:   ; result = 0 02:   xor eax, eax 03:   ; while (x != 0) 04:   test ecx, ecx 05:   je \$finish 06:   \$while: 07:   ; x &= x-1 08:   lea edx, [ecx-1] ; clever x86 trick: edx = ecx-1 09:   and ecx, edx 10:   ; result++ 11:   inc eax 12:   test ecx, ecx 13:   jne \$while 14:   \$finish: ``` bits.stephan-brumme.com