
Count bits set in parallel a.k.a. Population Count 

Code 
01: unsigned int countBits(unsigned int x)
bits.stephanbrumme.com

Explanation 
countBits (line 1 − line 13): Counting all set bits of an integer was part of many mainframe CPU's assembler language but somehow x86 CPUs ignored it for decades. Apparently Intel introduced the POPCNT opcode in its Core i7 design. Meanwhile, the population count has to be implemented by other means. The main observations lies in the fact that you can subdivide any bitblock into smaller chunks, compute their population count and add all intermediate results. First, the code counts the bits of two adjacent bits: 0b and 0b → 00b 0b and 1b → 01b 1b and 0b → 01b 1b and 1b → 10b The whole algorithm modifies the input in order to generate the output, that means it works inplace. Line 4 performs the 2bit count at once based on the observation: 00b → unchanged, still 00b 01b → unchanged, still 01b 10b → must be converted to 01b 11b → must be converted to 10b Whenever the higher bit of each 2bit group is set, subtracting 01b gives the desired outcome. Looks like branching ... but as it turns out, the subtraction can be done always: just subtract the higher bit ! If it is 0, the result remains unchanged, if it is 1, then we get the right numbers, too. The shift x >> 1 and the following mask of all odd bits (0x55 is 01010101b): 00b → shifted: ?0b → masked: 00b → subtraction: 00b − 00b → 00b 01b → shifted: ?0b → masked: 00b → subtraction: 01b − 00b → 01b 10b → shifted: ?1b → masked: 01b → subtraction: 10b − 01b → 01b 11b → shifted: ?1b → masked: 01b → subtraction: 11b − 01b → 10b Now the 2bit count is done. As you can see, there are just three possible decimal results: 0, 1 or 2. Then, two adjacent 2bit groups are joined to 4bit groups (line 6): 00b and 00b → 0000b 00b and 01b → 0001b 00b and 10b → 0010b 01b and 00b → 0001b 01b and 01b → 0010b 01b and 10b → 0011b 10b and 00b → 0010b 10b and 01b → 0011b 10b and 10b → 0100b This time, the 2bit groups are masked and shifted to match and then simply added. No overflow is possible. 00b + 00b → 0000b 00b + 01b → 0001b 00b + 10b → 0010b 01b + 00b → 0001b 01b + 01b → 0010b 01b + 10b → 0011b 10b + 00b → 0010b 10b + 01b → 0011b 10b + 10b → 0100b The same procedure is done for all 4bit groups yielding the bit counts for each of the four bytes (line 8) in their lower four bits. That means, each byte contains its bit count, however, the upper four bits may contain junk and are masked out (line 10). Multiplying by 0x01010101 has an interesting property if we name the four bytes A, B, C, D: A, B, C, D → A+B+C+D, B+C+D, C+D, D Obviously the highest byte is what we are looking for. The right shift (line 12) returns just it. countBitsLoop (line 15 − line 25): A very elegant and sometimes faster algorithm was presented in Brian Kernighan's and Dennis Ritchie's famous book "The C Programming Language". On the internet I found out that Peter Wegener actually deserves the credit for first publishing it. Unlike the other algorithms, its performance heavily depends on its input. When less than about four bits are set, it outperforms all other algorithms but suffers heavily when almost all bits are set. The main trick, stripping a single bit with x &= x − 1 (line 21), deserves some attention: • if x = 0, then the whileloop is not entered at all, so we do not need to consider this case • if the rightmost bit is 1, then the rightmost bit of x − 1 is 0. All other bits are identical and x &= x − 1 → x = x − 1. Because all other bits are identical we stripped one set bit, the rightmost bit. • if the rightmost bits are 0, then x looks like this: ...1000. And x − 1 looks like this: ...0111. Result of x &= x1: ...0000. Hence, x &= x − 1 clears all bits except for the ones represented as dots, they remain the same. Again, exactly one bit was cleared. In general, x &= x − 1 always sets the rightmost bit which was 1 to 0. 
Restrictions 
• designed for 32 bits • except for countBitsLoop, which is fine for all integer bit widths 
These ads help me to pay my server bills 

Performance 
+ Intel® Pentium™ D: • countBits: ≈ 23 cycles per number, constant time, data independent • countBitsLoop: ≈ 68 cycles per number (average) • countBitsLoop(0): ≈ 4 cycles per number when 0 bits are set (0% are set) • countBitsLoop(16): ≈ 54 cycles per number when 16 bits are set (50% are set) • countBitsLoop(32): ≈ 145 cycles per number when 32 bits are set (100% are set) • simple (unrolled forloop): ≈ 60 cycles per number, constant time, data independent + Intel® Core™ 2: • countBits: ≈ 16.5 cycles per number, constant time, data independent • countBitsLoop: ≈ 37.5 cycles per number (average) • countBitsLoop(0): ≈ 2.5 cycles per number when 0 bits are set (0% are set) • countBitsLoop(16): ≈ 37 cycles per number when 16 bits are set (50% are set) • countBitsLoop(32): ≈ 70 cycles per number when 32 bits are set (100% are set) • simple (unrolled forloop): ≈ 77 cycles per number, constant time, data independent + Intel® Core™ i7: • countBits: ≈ 16 cycles per number, constant time, data independent • countBitsLoop: ≈ 38 cycles per number (average) • countBitsLoop(0): ≈ 2.5 cycles per number when 0 bits are set (0% are set) • countBitsLoop(16): ≈ 37 cycles per number when 16 bits are set (50% are set) • countBitsLoop(32): ≈ 70 cycles per number when 32 bits are set (100% are set) • simple (unrolled forloop): ≈ 67 cycles per number, constant time, data independent CPU cycles (full optimization, lower values are better) 
Assembler Output 
countBits:
bits.stephanbrumme.com

Download  The full source code including a test program is available for download. 
References 
AMD, "AMD Software Optimization Guide for AMD64 Processors" Peter Wegner: "A Technique for Counting Ones in a Binary Computer", Communications of the ACM, Volume 3 (1960) Number 5, page 322 
More Twiddled Bits 
... or go to the index 