Instead of just the lowest set bit, I want to find the position of the nth lowest set bit. (I\'m NOT talking about value on the nt
Nowadays this is very easy with PDEP from the BMI2 instruction set. Here is a 64-bit version with some examples:
#include <cassert>
#include <cstdint>
#include <x86intrin.h>
inline uint64_t nthset(uint64_t x, unsigned n) {
return _pdep_u64(1ULL << n, x);
}
int main() {
assert(nthset(0b0000'1101'1000'0100'1100'1000'1010'0000, 0) ==
0b0000'0000'0000'0000'0000'0000'0010'0000);
assert(nthset(0b0000'1101'1000'0100'1100'1000'1010'0000, 1) ==
0b0000'0000'0000'0000'0000'0000'1000'0000);
assert(nthset(0b0000'1101'1000'0100'1100'1000'1010'0000, 3) ==
0b0000'0000'0000'0000'0100'0000'0000'0000);
assert(nthset(0b0000'1101'1000'0100'1100'1000'1010'0000, 9) ==
0b0000'1000'0000'0000'0000'0000'0000'0000);
assert(nthset(0b0000'1101'1000'0100'1100'1000'1010'0000, 10) ==
0b0000'0000'0000'0000'0000'0000'0000'0000);
}
It turns out that it is indeed possible to do this with no loops. It is fastest to precompute the (at least) 8 bit version of this problem. Of course, these tables use up cache space, but there should still be a net speedup in virtually all modern pc scenarios. In this code, n=0 returns the least set bit, n=1 is second-to-least, etc.
Solution with __popcnt
There is a solution using the __popcnt intrinsic (you need __popcnt to be extremely fast or any perf gains over a simple loop solution will be moot. Fortunately most SSE4+ era processors support it).
// lookup table for sub-problem: 8-bit v
byte PRECOMP[256][8] = { .... } // PRECOMP[v][n] for v < 256 and n < 8
ulong nthSetBit(ulong v, ulong n) {
ulong p = __popcnt(v & 0xFFFF);
ulong shift = 0;
if (p <= n) {
v >>= 16;
shift += 16;
n -= p;
}
p = __popcnt(v & 0xFF);
if (p <= n) {
shift += 8;
v >>= 8;
n -= p;
}
if (n >= 8) return 0; // optional safety, in case n > # of set bits
return PRECOMP[v & 0xFF][n] << shift;
}
This illustrates how the divide and conquer approach works.
General Solution
There is also a solution for "general" architectures- without __popcnt. It can be done by processing in 8-bit chunks. You need one more lookup table that tells you the popcnt of a byte:
byte PRECOMP[256][8] = { .... } // PRECOMP[v][n] for v<256 and n < 8
byte POPCNT[256] = { ... } // POPCNT[v] is the number of set bits in v. (v < 256)
ulong nthSetBit(ulong v, ulong n) {
ulong p = POPCNT[v & 0xFF];
ulong shift = 0;
if (p <= n) {
n -= p;
v >>= 8;
shift += 8;
p = POPCNT[v & 0xFF];
if (p <= n) {
n -= p;
shift += 8;
v >>= 8;
p = POPCNT[v & 0xFF];
if (p <= n) {
n -= p;
shift += 8;
v >>= 8;
}
}
}
if (n >= 8) return 0; // optional safety, in case n > # of set bits
return PRECOMP[v & 0xFF][n] << shift;
}
This could, of course, be done with a loop, but the unrolled form is faster and the unusual form of the loop would make it unlikely that the compiler could automatically unroll it for you.
I cant see a method without a loop, what springs to mind would be;
int set = 0;
int pos = 0;
while(set < n) {
if((bits & 0x01) == 1) set++;
bits = bits >> 1;
pos++;
}
after which, pos would hold the position of the nth lowest-value set bit.
The only other thing that I can think of would be a divide and conquer approach, which might yield O(log(n)) rather than O(n)...but probably not.
Edit: you said any behaviour, so non-termination is ok, right? :P
v-1 has a zero where v has its least significant "one" bit, while all more significant bits are the same. This leads to the following function:
int ffsn(unsigned int v, int n) {
for (int i=0; i<n-1; i++) {
v &= v-1; // remove the least significant bit
}
return v & ~(v-1); // extract the least significant bit
}
Building on the answer given by Jukka Suomela, which uses a machine-specific instruction that may not necessarily be available, it is also possible to write a function that does exactly the same thing as _pdep_u64 without any machine dependencies. It must loop over the set bits in one of the arguments, but can still be described as a constexpr function for C++11.
constexpr inline uint64_t deposit_bits(uint64_t x, uint64_t mask, uint64_t b, uint64_t res) {
return mask != 0 ? deposit_bits(x, mask & (mask - 1), b << 1, ((x & b) ? (res | (mask & (-mask))) : res)) : res;
}
constexpr inline uint64_t nthset(uint64_t x, unsigned n) {
return deposit_bits(1ULL << n, x, 1, 0);
}
My answer is mostly based on this implementation of a 64bit word select method (Hint: Look only at the MARISA_USE_POPCNT, MARISA_X64, MARISA_USE_SSE3 codepaths):
It works in two steps, first selecting the byte containing the n-th set bit and then using a lookup table inside the byte:
n to all bytes (SSE broadcast or again multiplication with 0x01010101...)n)Now we know which byte contains the bit and a simple byte lookup table like in grek40's answer suffices to get the result.
Note however that I have not really benchmarked this result against other implementations, only that I have seen it to be quite efficient (and branchless)