I need a fast way to get the position of all one bits in a 64-bit integer. For example, given x = 123703
, I\'d like to fill an array idx[] = {0, 1, 2, 4,
I believe the key to performance here is to focus on the larger problem rather than on micro-optimizing the extraction of bit positions out of a random integer.
Judging by your sample code and previous SO question you are enumerating all words with K bits set in order, and extracting the bit indices out of these. This greatly simplifies matters.
If so then instead of rebuilding the bit position each iteration try directly incrementing the positions in the bit array. Half of the time this will involve a single loop iteration and increment.
Something along these lines:
// Walk through all len-bit words with num-bits set in order
void enumerate(size_t num, size_t len) {
size_t i;
unsigned int bitpos[64 + 1];
// Seed with the lowest word plus a sentinel
for(i = 0; i < num; ++i)
bitpos[i] = i;
bitpos[i] = 0;
// Here goes the main loop
do {
// Do something with the resulting data
process(bitpos, num);
// Increment the least-significant series of consecutive bits
for(i = 0; bitpos[i + 1] == bitpos[i] + 1; ++i)
bitpos[i] = i;
// Stop on reaching the top
} while(++bitpos[i] != len);
}
// Test function
void process(const unsigned int *bits, size_t num) {
do
printf("%d ", bits[--num]);
while(num);
putchar('\n');
}
Not particularly optimized but you get the general idea.
Here's some tight code, written for 1-byte (8-bits), but it should easily, obviously expand to 64-bits.
int main(void)
{
int x = 187;
int ans[8] = {-1,-1,-1,-1,-1,-1,-1,-1};
int idx = 0;
while (x)
{
switch (x & ~(x-1))
{
case 0x01: ans[idx++] = 0; break;
case 0x02: ans[idx++] = 1; break;
case 0x04: ans[idx++] = 2; break;
case 0x08: ans[idx++] = 3; break;
case 0x10: ans[idx++] = 4; break;
case 0x20: ans[idx++] = 5; break;
case 0x40: ans[idx++] = 6; break;
case 0x80: ans[idx++] = 7; break;
}
x &= x-1;
}
getchar();
return 0;
}
Output array should be:
ans = {0,1,3,4,5,7,-1,-1};
As a minimal modification:
int64_t x;
char idx[K+1];
char *dst=idx;
const int BITS = 8;
for (int i = 0 ; i < 64+BITS; i += BITS) {
int y = (x & ((1<<BITS)-1));
char* end = strcat(dst, tab[y]); // tab[y] is a _string_
for (; dst != end; ++dst)
{
*dst += (i - 1); // tab[] is null-terminated so bit positions are 1 to BITS.
}
x >>= BITS;
}
The choice of BITS
determines the size of the table. 8, 13 and 16 are logical choices. Each entry is a string, zero-terminated and containing bit positions with 1 offset. I.e. tab[5] is "\x03\x01"
. The inner loop fixes this offset.
Slightly more efficient: replace the strcat
and inner loop by
char const* ptr = tab[y];
while (*ptr)
{
*dst++ = *ptr++ + (i-1);
}
Loop unrolling can be a bit of a pain if the loop contains branches, because copying those branch statements doesn't help the branch predictor. I'll happily leave that decision to the compiler.
One thing I'm considering is that tab[y]
is an array of pointers to strings. These are highly similar: "\x1"
is a suffix of "\x3\x1"
. In fact, each string which doesn't start with "\x8"
is a suffix of a string which does. I'm wondering how many unique strings you need, and to what degree tab[y]
is in fact needed. E.g. by the logic above, tab[128+x] == tab[x]-1
.
[edit]
Nevermind, you definitely need 128 tab entries starting with "\x8"
since they're never the suffix of another string. Still, the tab[128+x] == tab[x]-1
rule means that you can save half the entries, but at the cost of two extra instructions: char const* ptr = tab[x & 0x7F] - ((x>>7) & 1)
. (Set up tab[]
to point after the \x8
)
If I take "I need a fast way to get the position of all one bits in a 64-bit integer" literally...
I realise this is a few weeks old, however and out of curiosity, I remember way back in my assembly days with the CBM64 and Amiga using an arithmetic shift and then examining the carry flag - if it's set then the shifted bit was 1, if clear then it's zero
e.g. for an arithmetic shift left (examining from bit 64 to bit 0)....
pseudo code (ignore instruction mix etc errors and oversimplification...been a while):
move #64+1, counter
loop. ASL 64bitinteger
BCS carryset
decctr. dec counter
bne loop
exit
carryset.
//store #counter-1 (i.e. bit position) in datastruct indexed by counter
jmp decctr
...I hope you get the idea.
I've not used assembly since then but I'm wondering if we could use some C++ in-line assembly similar to the above to do something similar here. We could do the whole conversion in assembly (very few lines of code), building up an appropriate data structure. C++ could simply examine the answer.
If this is possible then I'd imagine it to be pretty fast.