Count bits set in parallel a.k.a. Population Count

Count bits set in parallel a.k.a. Population Count
Code	01: unsigned int countBits(unsigned int x) 02: { 03: // count bits of each 2-bit chunk 04: x = x - ((x >> 1) & 0x55555555); 05: // count bits of each 4-bit chunk 06: x = (x & 0x33333333) + ((x >> 2) & 0x33333333); 07: // count bits of each 8-bit chunk 08: x = x + (x >> 4); 09: // mask out junk 10: x &= 0xF0F0F0F; 11: // add all four 8-bit chunks 12: return (x * 0x01010101) >> 24; 13: } 14: 15: unsigned int countBitsLoop(unsigned int x) 16: { 17: unsigned int result = 0; 18: // strip one set bit per iteration 19: while (x != 0) 20: { 21: x &= x-1; 22: result++; 23: } 24: return result; 25: } bits.stephan-brumme.com

Explanation	countBits (line 1 − line 13): Counting all set bits of an integer was part of many mainframe CPU's assembler language but somehow x86 CPUs ignored it for decades. Apparently Intel introduced the POPCNT opcode in its Core i7 design. Meanwhile, the population count has to be implemented by other means. The main observations lies in the fact that you can subdivide any bitblock into smaller chunks, compute their population count and add all intermediate results. First, the code counts the bits of two adjacent bits: 0b and 0b → 00b 0b and 1b → 01b 1b and 0b → 01b 1b and 1b → 10b The whole algorithm modifies the input in order to generate the output, that means it works in-place. Line 4 performs the 2-bit count at once based on the observation: 00b → unchanged, still 00b 01b → unchanged, still 01b 10b → must be converted to 01b 11b → must be converted to 10b Whenever the higher bit of each 2-bit group is set, subtracting 01b gives the desired outcome. Looks like branching ... but as it turns out, the subtraction can be done always: just subtract the higher bit ! If it is 0, the result remains unchanged, if it is 1, then we get the right numbers, too. The shift x >> 1 and the following mask of all odd bits (0x55 is 01010101b): 00b → shifted: ?0b → masked: 00b → subtraction: 00b − 00b → 00b 01b → shifted: ?0b → masked: 00b → subtraction: 01b − 00b → 01b 10b → shifted: ?1b → masked: 01b → subtraction: 10b − 01b → 01b 11b → shifted: ?1b → masked: 01b → subtraction: 11b − 01b → 10b Now the 2-bit count is done. As you can see, there are just three possible decimal results: 0, 1 or 2. Then, two adjacent 2-bit groups are joined to 4-bit groups (line 6): 00b and 00b → 0000b 00b and 01b → 0001b 00b and 10b → 0010b 01b and 00b → 0001b 01b and 01b → 0010b 01b and 10b → 0011b 10b and 00b → 0010b 10b and 01b → 0011b 10b and 10b → 0100b This time, the 2-bit groups are masked and shifted to match and then simply added. No overflow is possible. 00b + 00b → 0000b 00b + 01b → 0001b 00b + 10b → 0010b 01b + 00b → 0001b 01b + 01b → 0010b 01b + 10b → 0011b 10b + 00b → 0010b 10b + 01b → 0011b 10b + 10b → 0100b The same procedure is done for all 4-bit groups yielding the bit counts for each of the four bytes (line 8) in their lower four bits. That means, each byte contains its bit count, however, the upper four bits may contain junk and are masked out (line 10). Multiplying by 0x01010101 has an interesting property if we name the four bytes A, B, C, D: A, B, C, D → A+B+C+D, B+C+D, C+D, D Obviously the highest byte is what we are looking for. The right shift (line 12) returns just it. countBitsLoop (line 15 − line 25): A very elegant and sometimes faster algorithm was presented in Brian Kernighan's and Dennis Ritchie's famous book "The C Programming Language". On the internet I found out that Peter Wegener actually deserves the credit for first publishing it. Unlike the other algorithms, its performance heavily depends on its input. When less than about four bits are set, it outperforms all other algorithms but suffers heavily when almost all bits are set. The main trick, stripping a single bit with x &= x − 1 (line 21), deserves some attention: • if x = 0, then the while-loop is not entered at all, so we do not need to consider this case • if the rightmost bit is 1, then the rightmost bit of x − 1 is 0. All other bits are identical and x &= x − 1 → x = x − 1. Because all other bits are identical we stripped one set bit, the rightmost bit. • if the rightmost bits are 0, then x looks like this: ...1000. And x − 1 looks like this: ...0111. Result of x &= x-1: ...0000. Hence, x &= x − 1 clears all bits except for the ones represented as dots, they remain the same. Again, exactly one bit was cleared. In general, x &= x − 1 always sets the rightmost bit which was 1 to 0.

Restrictions	• designed for 32 bits • except for countBitsLoop, which is fine for all integer bit widths

These ads help me to pay my server bills

Performance	+ Intel® Pentium™ D: • countBits: ≈ 23 cycles per number, constant time, data independent • countBitsLoop: ≈ 68 cycles per number (average) • countBitsLoop(0): ≈ 4 cycles per number when 0 bits are set (0% are set) • countBitsLoop(16): ≈ 54 cycles per number when 16 bits are set (50% are set) • countBitsLoop(32): ≈ 145 cycles per number when 32 bits are set (100% are set) • simple (unrolled for-loop): ≈ 60 cycles per number, constant time, data independent + Intel® Core™ 2: • countBits: ≈ 16.5 cycles per number, constant time, data independent • countBitsLoop: ≈ 37.5 cycles per number (average) • countBitsLoop(0): ≈ 2.5 cycles per number when 0 bits are set (0% are set) • countBitsLoop(16): ≈ 37 cycles per number when 16 bits are set (50% are set) • countBitsLoop(32): ≈ 70 cycles per number when 32 bits are set (100% are set) • simple (unrolled for-loop): ≈ 77 cycles per number, constant time, data independent + Intel® Core™ i7: • countBits: ≈ 16 cycles per number, constant time, data independent • countBitsLoop: ≈ 38 cycles per number (average) • countBitsLoop(0): ≈ 2.5 cycles per number when 0 bits are set (0% are set) • countBitsLoop(16): ≈ 37 cycles per number when 16 bits are set (50% are set) • countBitsLoop(32): ≈ 70 cycles per number when 32 bits are set (100% are set) • simple (unrolled for-loop): ≈ 67 cycles per number, constant time, data independent CPU cycles (full optimization, lower values are better)

Assembler Output	countBits: 01: ; x = x - ((x >> 1) & 0x55555555) 02: mov ecx, eax 03: shr ecx, 1 04: and ecx, 55555555h 05: sub eax, ecx 06: ; x = (x & 0x33333333) + ((x >> 2) & 0x33333333) 07: mov ecx, eax 08: and eax, 33333333h 09: shr ecx, 2 10: and ecx, 33333333h 11: add ecx, eax 12: ; x = x + (x >> 4) 13: mov eax, ecx 14: shr eax, 4 15: add eax, ecx 16: ; x &= 0xF0F0F0F 17: and eax, 0F0F0F0Fh 18: ; (x * 0x01010101) >> 24 19: imul eax, 01010101h 20: shr eax, 24 countBitsLoop: 01: ; result = 0 02: xor eax, eax 03: ; while (x != 0) 04: test ecx, ecx 05: je $finish 06: $while: 07: ; x &= x-1 08: lea edx, [ecx-1] ; clever x86 trick: edx = ecx-1 09: and ecx, edx 10: ; result++ 11: inc eax 12: test ecx, ecx 13: jne $while 14: $finish: bits.stephan-brumme.com

Download	The full source code including a test program is available for download.

References	AMD, "AMD Software Optimization Guide for AMD64 Processors" Peter Wegner: "A Technique for Counting Ones in a Binary Computer", Communications of the ACM, Volume 3 (1960) Number 5, page 322

More Twiddled Bits	Absolute value of a float Absolute value of an integer Approximative inverse of a float Bit manipulation basics Bit mask of lowest bit not set Count bits set in parallel a.k.a. Population Count Detects zero bytes inside a 32 bit integer Endianess Extend bit width Float inverse square root approximation Float square root approximation Is power of two Minimum / maximum of integers Parity Position of lowest bit set Round up to the next power of two Sign comparison Sign of a 32 bit integer Swap two values Extra: Javascript bit manipulator ... or go to the index