# Position of lowest bit set

Code
``` 01: const unsigned int MultiplyDeBruijnBitPosition = 02: { 03:   // precomputed lookup table 04:   0,  1, 28,  2, 29, 14, 24,  3, 30, 22, 20, 15, 25, 17,  4,  8, 05:   31, 27, 13, 23, 21, 19, 16,  7, 26, 12, 18,  6, 11,  5, 10,  9 06: }; 07: 08: unsigned int lowestBitSet(unsigned int x) 09: { 10:   // only lowest bit will remain 1, all others become 0 11:   x  &= -int(x); 12:   // DeBruijn constant 13:   x  *= 0x077CB531; 14:   // the upper 5 bits are unique, skip the rest (32 - 5 = 27) 15:   x >>= 27; 16:   // convert to actual position 17:   return MultiplyDeBruijnBitPosition[x]; 18: } 19: 20: unsigned int lowestBitSetSimple(unsigned int x) 21: { 22:   // x=0 is not properly handled by while-loop 23:   if (x == 0) 24:     return 0; 25: 26:   unsigned int result = 0; 27:   while ((x & 1) == 0) 28:   { 29:     x >>= 1; 30:     result++; 31:   } 32: 33:   return result; 34: } ``` bits.stephan-brumme.com

Explanation A negative number is usually represented by the two-complement: −x = not(x)+1

The formula x & −x = x & (not(x) + 1) deletes all but the lowest set bit (line 11).

That means after line 11 we have a power of 2 and there are obviously only 32 possibilities left.
When multiplied by the DeBruijn constant 0x077CB531 (line 13), the upper 5 bits are unique for each power of 2.
A small lookup table resolves the actual position of the lowest bit set (line 1 to 6, line 17).
The overall performance of this algorithm is mainly determined by the lookup table and
the numbers given below are based on full cache hits.

Note: DeBruijn sequences are extensively explained by Wikipedia, too.

The simple version performs surprisingly well for randomly distributed input:
50% of all integers are odd and their result will be 0. Another 25% result will be 1, therefore the loop terminates pretty early.
When numbers with long spans of zeros are processed, timings become significantly worse.
The observed break-even was lowestBitSet(32) on a Pentium D, your mileage may vary.

The fastest way on almost all modern x86/x64 CPU bases on the built-in BSF instruction.

Restrictions • designed for 32 bits

help me to pay
my server bills

Performance + Intel® Pentium™ D:
lowestBitSet: ≈ 18 cycles per number
• simple(1,2): ≈ 4 cycles per number
• simple(mix): ≈ 39 cycles per number
• BSF: ≈ 3 cycles per number

+ Intel® Core™ 2:
lowestBitSet: ≈ 10.5 cycles per number
• simple(1,2): ≈ 2.5 cycles per number
• simple(mix): ≈ 21 cycles per number
• BSF: ≈ 1 cycle per number

+ Intel® Core™ i7:
lowestBitSet: ≈ 10.5 cycles per number
• simple(1,2): ≈ 2.5 cycles per number
• simple(mix): ≈ 21 cycles per number
• BSF: ≈ 1 cycle per number CPU cycles
(full optimization,
lower values are better)

Assembler Output
``` lowestBitSet: 01:   ; x &= -x 02:   mov ecx, eax 03:   neg ecx 04:   and ecx, eax 05:   ; x *= 0x077CB531 06:   imul ecx, 077CB531h 07:   ; x >>= 27 08:   shr ecx, 27 09:   ; return MultiplyDeBruijnBitPosition[x]; 10:   mov eax, MultiplyDeBruijnBitPosition[ecx*4] lowestBitSetSimple: 01:   ; result = 0 02:   xor eax, eax 03:   ; if (x == 0) return 0 04:   test ecx, ecx 05:   je \$finish 06:   ; while ((x & 1) == 0) 07:   test cl, 1 08:   jne \$finish 09:   \$scanLoop: 10:   ; x >>= 1 11:   shr ecx, 1 12:   ; result++ 13:   inc eax 14:   test cl, 1 15:   je \$scanLoop 16:   \$finish: ``` bits.stephan-brumme.com