How to quickly count bits into separate bins in a series of ints on Sandy Bridge? [duplicate]

问题

Update: Please read the code, it is NOT about counting bits in one int

Is it possible to improve performance of the following code with some clever assembler?

uint bit_counter[64];

void Count(uint64 bits) {
  bit_counter[0] += (bits >> 0) & 1;
  bit_counter[1] += (bits >> 1) & 1;
  // ..
  bit_counter[63] += (bits >> 63) & 1;
}

Count is in the inner-most loop of my algorithm.

Update: Architecture: x86-64, Sandy Bridge, so SSE4.2, AVX1 and older tech can be used, but not AVX2 or BMI1/2.

bits variable has almost random bits (close to half zeros and half ones)

回答1:

Maybe you can do 8 at once, by taking 8 bits spaced 8 apart and keeping 8 uint64's for the counts. That's only 1 byte per single counter though, so you can only accumulate 255 invocations of count before you'd have to unpack those uint64's.

回答2:

You could try doing it with SSE, incrementing 4 elements per iteration.

Warning: untested code follows...

#include <stdint.h>
#include <emmintrin.h>

uint32_t bit_counter[64] __attribute__ ((aligned(16)));
                     // make sure bit_counter array is 16 byte aligned for SSE

void Count_SSE(uint64 bits)
{
    const __m128i inc_table[16] = {
        _mm_set_epi32(0, 0, 0, 0),
        _mm_set_epi32(0, 0, 0, 1),
        _mm_set_epi32(0, 0, 1, 0),
        _mm_set_epi32(0, 0, 1, 1),
        _mm_set_epi32(0, 1, 0, 0),
        _mm_set_epi32(0, 1, 0, 1),
        _mm_set_epi32(0, 1, 1, 0),
        _mm_set_epi32(0, 1, 1, 1),
        _mm_set_epi32(1, 0, 0, 0),
        _mm_set_epi32(1, 0, 0, 1),
        _mm_set_epi32(1, 0, 1, 0),
        _mm_set_epi32(1, 0, 1, 1),
        _mm_set_epi32(1, 1, 0, 0),
        _mm_set_epi32(1, 1, 0, 1),
        _mm_set_epi32(1, 1, 1, 0),
        _mm_set_epi32(1, 1, 1, 1)
    };

    for (int i = 0; i < 64; i += 4)
    {
        __m128i vbit_counter = _mm_load_si128(&bit_counter[i]);
                                          // load 4 ints from bit_counter
        int index = (bits >> i) & 15;     // get next 4 bits
        __m128i vinc = inc_table[index];  // look up 4 increments from LUT
        vbit_counter = _mm_add_epi32(vbit_counter, vinc);
                                          // increment 4 elements of bit_counter
        _mm_store_si128(&bit_counter[i], vbit_counter);
    }                                     // store 4 updated ints
}

How it works: essentially all we are doing here is vectorizing the original loop so that we process 4 bits per loop iteration instead of 1. So we now have 16 loop iterations instead of 64. For each iteration we load 4 bits from bits, then use them as an index into a LUT which contains all possible combinations of 4 increments for the current 4 bits. We then add these 4 increments to the current 4 elements of bit_counter.

The number of loads and stores and adds is reduced by a factor of 4, but this will be offset somewhat by the LUT load and other housekeeping. You may still see a 2x speed up though. I'd be interested to know the result if you do decide to try it.

回答3:

Look at Bit Twiddling Hacks

Counting bits set
- Counting bits set, naive way
- Counting bits set by lookup table
- Counting bits set, Brian Kernighan's way
- Counting bits set in 12, 24, or 32-bit words using 64-bit instructions
- Counting bits set, in parallel
- Count bits set (rank) from the most-significant bit upto a given position
- Select the bit position (from the most-significant bit) with the given count (rank)

Edit As for the 'bit position bucket accumulation' (bit_counter[]) I have a feeling that this might be a good case for valarrays + masking. That'd be a fair bit of coding+testing+profiling though. Let me know if you are really interested.

You could, these days, come very close to valarray behaviour using tied tuples (TR1, boost or C++11); I have a feeling it would come out being simpler to read and slower to compile.

回答4:

Apparently this can be done quickly with "vertical counters". From the now-defunct page on Bit tricks (archive) by @steike:

Consider a normal array of integers, where we read the bits horizontally:
       msb<-->lsb
  x[0]  00000010  = 2
  x[1]  00000001  = 1
  x[2]  00000101  = 5
A vertical counter stores the numbers, as the name implies, vertically; that is, a k-bit counter is stored across k words, with a single bit in each word.
  x[0]  00000110   lsb ↑
  x[1]  00000001       |
  x[2]  00000100       |
  x[3]  00000000       |
  x[4]  00000000   msb ↓
             512
With the numbers stored like this, we can use bitwise operations to increment any subset of them all at once.

We create a bitmap with a 1 bit in the positions corresponding to the counters we want to increment, and loop through the array from LSB up, updating the bits as we go. The "carries" from one addition becomes the input for the next element of the array.
  input  sum

--------------------------------------------------------------------------------
   A B   C S
   0 0   0 0
   0 1   0 1      sum    = a ^ b
   1 0   0 1      carry  = a & b
   1 1   1 1

  carry = input;
  long *p = buffer;
  while (carry) {
    a = *p; b = carry;
    *p++ = a ^ b;
    carry = a & b;
  }
For 64-bit words the loop will run 6-7 times on average -- the number of iterations is determined by the longest chain of carries.

回答5:

You can unroll your function like this. It is probably faster than what your compiler can do!

//   rax as 64 bit input
   xor  rcx, rcx                //clear addent

   add  rax, rax                //Copy 63th bit to carry flag
   adc  dword ptr [@bit_counter + 63 * 4], ecx    //Add carry bit to counter[64]

   add  rax, rax                //Copy 62th bit to carry flag
   adc  dword ptr [@bit_counter + 62 * 4], ecx    //Add carry bit to counter[63]

   add  rax, rax                //Copy 62th bit to carry flag
   adc  dword ptr [@bit_counter + 61 * 4], ecx    //Add carry bit to counter[62]
//   ...
   add  rax, rax                //Copy 1th bit to carry flag
   adc  dword ptr [@bit_counter + 1 * 4], ecx     //Add carry bit to counter[1]

   add  rax, rax                //Copy 0th bit to carry flag
   adc  dword ptr [@bit_counter], ecx             //Add carry bit to counter[0]

EDIT:

You can try also with double increment like this:

//   rax as 64 bit input
   xor  rcx, rcx                //clear addent
//
   add  rax, rax                //Copy 63th bit to carry flag
   rcl  rcx, 33                 //Mov carry to 32th bit as 0bit of second uint
   add  rax, rax                //Copy 62th bit to carry flag
   adc  qword ptr [@bit_counter + 62 * 8], rcx  //Add rcx to 63th and 62th counters

   add  rax, rax                //Copy 61th bit to carry flag
   rcl  rcx, 33                 //Mov carry to 32th bit as 0bit of second uint
   add  rax, rax                //Copy 60th bit to carry flag
   adc  qword ptr [@bit_counter + 60 * 8], rcx  //Add rcx to 61th and 60th counters
//...

回答6:

You could use the set of counters, each of different size. Firstly accumulate 3 values in 2-bit counters, then unpack them and update 4-bit counters. When 15 values are ready, unpack to byte-sized counters, and after 255 values update bit_counter[].

All this work may be done in parallel in 128 bit SSE registers. On modern processors only one instruction is needed to unpack 1 bit to 2. Just multiply source quadword to itself with PCLMULQDQ instruction. This will interleave source bits with zeros. The same trick may help to unpack 2 bits to 4. And unpacking of 4 and 8 bits may be done with shuffles, unpacks and simple logical operations.

Average performance seems to be good, but the price is 120 bytes for additional counters and quite a lot of assembly code.

回答7:

There's no way to answer this in general; it all depends on the compiler and the underlying architecture. The only real way to know is to try different solutions, and measure. (On some machines, for example, shifts can be very expensive. On others, no.) For starters, I'd use something like:

uint64_t mask = 1;
int index = 0;
while ( mask != 0 ) {
    if ( (bits & mask) != 0 ) {
        ++ bit_counter[index];
    }
    ++ index;
    mask <<= 1;
}

Unrolling the loop completely will likely improve performance. Depending on the architecture, replacing the if with:

bit_counter[index] += ((bits & mask) != 0);

might be better. Or worse... it's impossible to know in advance. It's also possible that on some machines, systematically shifting into the low order bit and masking, as you are doing, would be best.

Some optimizations will also depend on what typical data looks like. If most of the words only have one or two bits set, you might gain by testing a byte at at time, or four bits at a time, and skipping those that are all zeros completely.

回答8:

If you count how often each nibble (16 possibilities) occurs at each offset (16 possibilities), you can easily sum the results. And those 256 sums are easily kept:

unsigned long nibble_count[16][16]; // E.g. 0x000700B0 corresponds to [4][7] and [2][B]
unsigned long bitcount[64];

void CountNibbles(uint64 bits) {
  // Count nibbles
  for (int i = 0; i != 16; ++i) {
     nibble_count[i][bits&0xf]++;
     bits >>= 4;
  }
}
void SumNibbles() {
  for (int i = 0; i != 16; ++i) {
    for (int nibble = 0; nibble != 16; ++nibble) {
        for(int bitpos = 0; bitpos != 3; ++bitpos) {
           if (nibble & (1<<bitpos)) {
              bitcount[i*4 + bitpos] += nibble_count[i][nibble];
           }
        }
     }
   }
}

回答9:

This is fairly fast:

void count(uint_fast64_t bits){
    uint_fast64_t i64=ffs64(bits);
    while(i64){
        bit_counter[i64-1]++;
        bits=bits & 0xFFFFFFFFFFFFFFFF << i64;
        i64=ffs64(bits);
    }
}

You need to have a fast implementation of ffs for 64 bits. For most compiler and CPU's this is a single instruction. The loop is executed once for each bit in the word, so bits=0 will be very fast and bits being 64 bits of 1 will be slower.

I tested this under 64 bit Ubuntu with GCC, and it produces the same data output as your:

void Count(uint64 bits) {
  bit_counter[0] += (bits >> 0) & 1;
  bit_counter[1] += (bits >> 1) & 1;
  // ..
  bit_counter[63] += (bits >> 63) & 1;
}

The speed is variable based on the number of 1 bits in the 64 bit word.

来源：https://stackoverflow.com/questions/7793997/how-to-quickly-count-bits-into-separate-bins-in-a-series-of-ints-on-sandy-bridge

标签

c++

assembly

x86

simd

avx