Count bits set in parallel a.k.a. Population Count

Code
01: unsigned int countBits(unsigned int x)
02: {
03:   // count bits of each 2-bit chunk
04:   x  = x - ((x >> 1) & 0x55555555);
05:   // count bits of each 4-bit chunk
06:   x  = (x & 0x33333333) + ((x >> 2) & 0x33333333);
07:   // count bits of each 8-bit chunk
08:   x  = x + (x >> 4);
09:   // mask out junk
10:   x &= 0xF0F0F0F;
11:   // add all four 8-bit chunks
12:   return (x * 0x01010101) >> 24;
13: }
14:
15: unsigned int countBitsLoop(unsigned int x)
16: {
17:   unsigned int result = 0;
18:   // strip one set bit per iteration
19:   while (x != 0)
20:   {
21:     x &= x-1;
22:     result++;
23:   }
24:   return result;
25: }
bits.stephan-brumme.com

Explanation countBits (line 1line 13):
Counting all set bits of an integer was part of many mainframe CPU's assembler language but somehow
x86 CPUs ignored it for decades. Apparently Intel introduced the POPCNT opcode in its Core i7 design.

Meanwhile, the population count has to be implemented by other means.
The main observations lies in the fact that you can subdivide any bitblock into smaller chunks,
compute their population count and add all intermediate results.

First, the code counts the bits of two adjacent bits:

0b and 0b → 00b
0b and 1b → 01b
1b and 0b → 01b
1b and 1b → 10b

The whole algorithm modifies the input in order to generate the output, that means it works in-place.
Line 4 performs the 2-bit count at once based on the observation:

00b → unchanged, still 00b
01b → unchanged, still 01b
10b → must be converted to 01b
11b → must be converted to 10b

Whenever the higher bit of each 2-bit group is set, subtracting 01b gives the desired outcome.
Looks like branching ... but as it turns out, the subtraction can be done always: just subtract the higher bit !
If it is 0, the result remains unchanged, if it is 1, then we get the right numbers, too.
The shift x >> 1 and the following mask of all odd bits (0x55 is 01010101b):

00b → shifted: ?0b → masked: 00b → subtraction: 00b − 00b → 00b
01b → shifted: ?0b → masked: 00b → subtraction: 01b − 00b → 01b
10b → shifted: ?1b → masked: 01b → subtraction: 10b − 01b → 01b
11b → shifted: ?1b → masked: 01b → subtraction: 11b − 01b → 10b

Now the 2-bit count is done. As you can see, there are just three possible decimal results: 0, 1 or 2.

Then, two adjacent 2-bit groups are joined to 4-bit groups (line 6):

00b and 00b → 0000b
00b and 01b → 0001b
00b and 10b → 0010b
01b and 00b → 0001b
01b and 01b → 0010b
01b and 10b → 0011b
10b and 00b → 0010b
10b and 01b → 0011b
10b and 10b → 0100b

This time, the 2-bit groups are masked and shifted to match and then simply added. No overflow is possible.

00b + 00b → 0000b
00b + 01b → 0001b
00b + 10b → 0010b
01b + 00b → 0001b
01b + 01b → 0010b
01b + 10b → 0011b
10b + 00b → 0010b
10b + 01b → 0011b
10b + 10b → 0100b

The same procedure is done for all 4-bit groups yielding the bit counts for each of the four bytes (line 8)
in their lower four bits. That means, each byte contains its bit count, however, the upper four bits may
contain junk and are masked out (line 10).

Multiplying by 0x01010101 has an interesting property if we name the four bytes A, B, C, D:

A, B, C, D → A+B+C+D, B+C+D, C+D, D

Obviously the highest byte is what we are looking for. The right shift (line 12) returns just it.


countBitsLoop (line 15line 25):

A very elegant and sometimes faster algorithm was presented in Brian Kernighan's and Dennis Ritchie's famous book
"The C Programming Language". On the internet I found out that Peter Wegener actually deserves the credit for first publishing it.

Unlike the other algorithms, its performance heavily depends on its input. When less than about four bits are set, it outperforms all
other algorithms but suffers heavily when almost all bits are set.

The main trick, stripping a single bit with x &= x − 1 (line 21), deserves some attention:
• if x = 0, then the while-loop is not entered at all, so we do not need to consider this case
• if the rightmost bit is 1, then the rightmost bit of x − 1 is 0. All other bits are identical and x &= x − 1 → x = x − 1.
  Because all other bits are identical we stripped one set bit, the rightmost bit.
• if the rightmost bits are 0, then x looks like this: ...1000. And x − 1 looks like this: ...0111. Result of x &= x-1: ...0000.
  Hence, x &= x − 1 clears all bits except for the ones represented as dots, they remain the same. Again, exactly one bit was cleared.
In general, x &= x − 1 always sets the rightmost bit which was 1 to 0.

Restrictions • designed for 32 bits
• except for countBitsLoop, which is fine for all integer bit widths

These ads
help me to pay
my server bills

Performance + Intel® Pentium™ D:
countBits: ≈ 23 cycles per number, constant time, data independent
countBitsLoop: ≈ 68 cycles per number (average)
countBitsLoop(0): ≈ 4 cycles per number when 0 bits are set (0% are set)
countBitsLoop(16): ≈ 54 cycles per number when 16 bits are set (50% are set)
countBitsLoop(32): ≈ 145 cycles per number when 32 bits are set (100% are set)
• simple (unrolled for-loop): ≈ 60 cycles per number, constant time, data independent

+ Intel® Core™ 2:
countBits: ≈ 16.5 cycles per number, constant time, data independent
countBitsLoop: ≈ 37.5 cycles per number (average)
countBitsLoop(0): ≈ 2.5 cycles per number when 0 bits are set (0% are set)
countBitsLoop(16): ≈ 37 cycles per number when 16 bits are set (50% are set)
countBitsLoop(32): ≈ 70 cycles per number when 32 bits are set (100% are set)
• simple (unrolled for-loop): ≈ 77 cycles per number, constant time, data independent

+ Intel® Core™ i7:
countBits: ≈ 16 cycles per number, constant time, data independent
countBitsLoop: ≈ 38 cycles per number (average)
countBitsLoop(0): ≈ 2.5 cycles per number when 0 bits are set (0% are set)
countBitsLoop(16): ≈ 37 cycles per number when 16 bits are set (50% are set)
countBitsLoop(32): ≈ 70 cycles per number when 32 bits are set (100% are set)
• simple (unrolled for-loop): ≈ 67 cycles per number, constant time, data independent
Performance Chart
CPU cycles
(full optimization,
lower values are better)

Assembler Output
countBits:
compiler: VC 2008
in: eax ← x
out: eax → result
temp: ecx
01:   ; x = x - ((x >> 1) & 0x55555555)
02:   mov ecx, eax
03:   shr ecx, 1
04:   and ecx, 55555555h
05:   sub eax, ecx
06:   ; x = (x & 0x33333333) + ((x >> 2) & 0x33333333)
07:   mov ecx, eax
08:   and eax, 33333333h
09:   shr ecx, 2
10:   and ecx, 33333333h
11:   add ecx, eax
12:   ; x = x + (x >> 4)
13:   mov eax, ecx
14:   shr eax, 4
15:   add eax, ecx
16:   ; x &= 0xF0F0F0F
17:   and eax, 0F0F0F0Fh
18:   ; (x * 0x01010101) >> 24
19:   imul eax, 01010101h
20:   shr eax, 24

countBitsLoop:
compiler: VC 2008
in: ecx ← x
out: eax → result
temp: edx
01:   ; result = 0
02:   xor eax, eax
03:   ; while (x != 0)
04:   test ecx, ecx
05:   je $finish
06: $while:
07:   ; x &= x-1
08:   lea edx, [ecx-1] ; clever x86 trick: edx = ecx-1
09:   and ecx, edx
10:   ; result++
11:   inc eax
12:   test ecx, ecx
13:   jne $while
14: $finish:
bits.stephan-brumme.com

Download The full source code including a test program is available for download.

References AMD, "AMD Software Optimization Guide for AMD64 Processors"
Peter Wegner: "A Technique for Counting Ones in a Binary Computer", Communications of the ACM, Volume 3 (1960) Number 5, page 322

More Twiddled Bits
  1. Absolute value of a float
  2. Absolute value of an integer
  3. Approximative inverse of a float
  4. Bit manipulation basics
  5. Bit mask of lowest bit not set
  6. Count bits set in parallel a.k.a. Population Count
  7. Detects zero bytes inside a 32 bit integer
  8. Endianess
  9. Extend bit width
  10. Float inverse square root approximation
  11. Float square root approximation
  12. Is power of two
  13. Minimum / maximum of integers
  14. Parity
  15. Position of lowest bit set
  16. Round up to the next power of two
  17. Sign comparison
  18. Sign of a 32 bit integer
  19. Swap two values
Extra: Javascript bit manipulator

... or go to the index