I need a fast way to get the position of all one bits in a 64-bit integer. For example, given x = 123703
, I\'d like to fill an array idx[] = {0, 1, 2, 4,
Using char wouldn't help you to increase speed but in fact often needs more ANDing and sign/zero extending while calculating. Only in the case of very large arrays that should fit in cache, smaller int types should be used
Another thing you can improve is the COPY macro. Instead of copy byte-by-byte, copy the whole word if possible
inline COPY(unsigned char *dst, unsigned char *src, int n)
{
switch(n) { // remember to align dst and src when declaring
case 8:
*((int64_t*)dst) = *((int64_t*)src);
break;
case 7:
*((int32_t*)dst) = *((int32_t*)src);
*((int16_t*)(dst + 4)) = *((int32_t*)(src + 4));
dst[6] = src[6];
break;
case 6:
*((int32_t*)dst) = *((int32_t*)src);
*((int16_t*)(dst + 4)) = *((int32_t*)(src + 4));
break;
case 5:
*((int32_t*)dst) = *((int32_t*)src);
dst[4] = src[4];
break;
case 4:
*((int32_t*)dst) = *((int32_t*)src);
break;
case 3:
*((int16_t*)dst) = *((int16_t*)src);
dst[2] = src[2];
break;
case 2:
*((int16_t*)dst) = *((int16_t*)src);
break;
case 1:
dst[0] = src[0];
break;
case 0:
break;
}
Also, since tabofs[x] and n[x] is often access close to each other, try putting it close in memory to make sure they are always in cache at the same time
typedef struct TAB_N
{
int16_t n, tabofs;
} tab_n[256];
src=tab0+tab_n[b0].tabofs; COPY(dst, src, tab_n[b0].n);
src=tab0+tab_n[b1].tabofs; COPY(dst, src, tab_n[b1].n);
src=tab0+tab_n[b2].tabofs; COPY(dst, src, tab_n[b2].n);
src=tab0+tab_n[b3].tabofs; COPY(dst, src, tab_n[b3].n);
src=tab0+tab_n[b4].tabofs; COPY(dst, src, tab_n[b4].n);
src=tab0+tab_n[b5].tabofs; COPY(dst, src, tab_n[b5].n);
Last but not least, gettimeofday
is not for performance counting. Use QueryPerformanceCounter instead, it's much more precise
Your code is using 1-byte (256 entries) index table. You can speed it up by factor of 2 if you use 2-byte (65536 entries) index table.
Unfortunately, you probably cannot extend that further - for 3-bytes table size would be 16MB, not likely to fit into CPU local cache, and it would only make things slower.
Has this been found to be too slow?
Small and crude, but it's all in the cache and CPU registers;
void mybits(uint64_t x, unsigned char *idx)
{
unsigned char n = 0;
do {
if (x & 1) *(idx++) = n;
n++;
} while (x >>= 1); // If x is signed this will never end
*idx = (unsigned char) 255; // List Terminator
}
It's still 3 times faster to unroll the loop and produce an array of 64 true/false values (which isn't quite what's wanted)
void mybits_3_2(uint64_t x, idx_type idx[])
{
#define SET(i) (idx[i] = (x & (1UL<<i)))
SET( 0);
SET( 1);
SET( 2);
SET( 3);
...
SET(63);
}
Assuming sparsity in number of set bits,
int count = 0;
unsigned int tmp_bitmap = x;
while (tmp_bitmap > 0) {
int next_psn = __builtin_ffs(tmp_bitmap) - 1;
tmp_bitmap &= (tmp_bitmap-1);
id[count++] = next_psn;
}
The question is what are you going to do with the collection of positions?
If you have to iterate many times over it, then yes, it might be interesting to gather them once as you are doing now, and iterate many.
But if it's for iterating just once or few times, then you might think of not creating an intermediate array of positions, and just invoke a processing block closure/function at each encountered 1 while iterating on bits.
Here is a naive example of bit iterator I wrote in Smalltalk:
LargePositiveInteger>>bitsDo: aBlock
| mask offset |
1 to: self digitLength do: [:iByte |
offset := (iByte - 1) << 3.
mask := (self digitAt: iByte).
[mask = 0]
whileFalse:
[aBlock value: mask lowBit + offset.
mask := mask bitAnd: mask - 1]]
A LargePositiveInteger is an Integer of arbitrary length composed of byte digits. The lowBit answer the rank of lowest bit and is implemented as a lookup table with 256 entries.
In C++ 2011 you can easily pass a closure, so it should be easy to translate.
uint64_t x;
unsigned int mask;
void (*process_bit_position)(unsigned int);
unsigned char offset = 0;
unsigned char lowBitTable[16] = {0,0,1,0,2,0,1,0,3,0,1,0,2,0,1,0}; // 0-based, first entry is unused
while( x )
{
mask = x & 0xFUL;
while (mask)
{
process_bit_position( lowBitTable[mask]+offset );
mask &= mask - 1;
}
offset += 4;
x >>= 4;
}
The example is demonstrated with a 4 bit table, but you can easily extend it to 13 bits or more if it fits in cache.
For branch prediction, the inner loop could be rewritten as a for(i=0;i<nbit;i++)
with an additional tablenbit=numBitTable[mask]
then unrolled with a switch (the compiler could do it?), but I let you measure how it performs first...
Here's something very simple which might be faster - no way to know without testing. Much will depend on the number of bits set vs. the number unset. You could unroll this to remove branching altogether but with today's processors I don't know if it would speed up at all.
unsigned char idx[K+1]; // need one extra for overwrite protection
unsigned char *dst=idx;
for (unsigned char i = 0; i < 50; i++)
{
*dst = i;
dst += x & 1;
x >>= 1;
}
P.S. your sample output in the question is wrong, see http://ideone.com/2o032E