Count bits 1 on an integer as fast as GCC __builtin__popcount(int)

独自空忆成欢 提交于 2020-06-14 07:38:24

问题


I write a algorithm (taken from "The C Programming Language") that counts the number of 1-bits very fast:

int countBit1Fast(int n)
{
    int c = 0;
    for (; n; ++c)
        n &= n - 1;
    return c;
}

But a friend told me that __builtin__popcount(int) is a lot faster, but less portable. I give it a try and was MANY times faster! Why it's so fast? I want to count bits as fast as possible, but without stick to a particular compiler.

EDIT: I may use it on PIC micro-controllers and maybe on non-intel processors, so I need the maximum portability.


回答1:


I write a algorithm (taken from "The C Programming Language") that counts the number of 1-bits very fast:

I don't see why anyone would characterize your approach as "very fast". It's a bit clever, and it should be faster on average than naive alternatives. It also does not depend on the width of the representation of int, which is a plus. I observe that it has undefined behavior for negative arguments, but that's a common theme for bitwise operators and functions.

Let's analyze, supposing a non-negative argument:

int c = 0;
for (; n; ++c)
    n &= n - 1;
  • How many loop iterations are performed?

    1 for each 1 bit in the binary representation of the value, irrespective of where in the value each bit lies

  • How much work is performed per iteration

    • one increment of c
    • one comparison of n against zero (plus one more of these when breaking out of the loop)
    • one decrement of n by 1
    • one bitwise 'and'

    That ignores reads and stores, which very likely can be made free or especially cheap by keeping the operands in registers. If we assume equal cost for each of those, that's four operations per iteration. For random 32-bit integers, there will be an average of 16 iterations, for a total of 65 operations on average. (Best case is just one operation, but worst is 129, which is no better than a naive implementation).

__builtin__popcount(), on the other hand, uses a single instruction regardless of input on platforms that support it, such as yours very likely is. Even on those that don't have a for-purpose instruction, however, it can be done faster (on average).

@dbush has presented one such mechanism that has similar advantages to the one you present. In particular, it does not depend on a pre-chosen integer width, and although it does depend on where in the representation the 1 bits reside, it does run faster for some arguments (smaller ones) than others. If I'm counting right, that one will average around 20 operations on random 32-bit inputs: five in each of four loop iterations (only 0.4% of random inputs would require fewer than four iterations). I'm counting one table read per iteration there, which I assume can be served from cache, but which is probably still not as fast as an arithmetic operation on values already held in registers.

One that is strictly computational would be:

int countBit1Fast(uint32_t n) {
    n = (n & 0x55555555u) + ((n >> 1) & 0x55555555u);
    n = (n & 0x33333333u) + ((n >> 2) & 0x33333333u);
    n = (n & 0x0f0f0f0fu) + ((n >> 4) & 0x0f0f0f0fu);
    n = (n & 0x00ff00ffu) + ((n >> 8) & 0x00ff00ffu);
    n = (n & 0x0000ffffu) + ((n >>16) & 0x0000ffffu);
    return n;
}

That's pretty easy to count: five additions, five shifts, and ten bitwise 'and' operations, and 5 loads of constants for a total of 25 operations for every input (and it goes up only to 30 for 64-bit inputs, though those are now 64-bit operations instead of 32-bit ones). This version is, however, intrinsically dependent on a particular size of the input data type.




回答2:


As others have mentioned, __buildin__popcount() is fast because it uses a single x86 instruction.

If you want something faster than what you have that doesn't use anything processor or compiler specific you can create a lookup table with 256 entries:

int bitcount[] = {
    0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
    1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
    1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
    2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
    1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
    2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
    2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
    3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
    1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
    2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
    2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
    3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
    2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
    3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
    3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
    4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8,
};

Then use that to get the bit count of each byte:

int countBit1Fast(int n) 
{
    int i, count = 0;
    unsigned char *ptr = (unsigned char *)n;
    for (i=0;i<sizeof(int);i++) {
        count += bitcount[ptr[i]];
    }
    return count;
}



回答3:


The __builtin__popcount(unsigned int) is so fast because it is a gcc extension that utilizes a builtin hardware instruction. If you are willing to trade architecture portability for compiler portability, look into the just-as-fast intel intrinsic functions, specifically:

_mm_popcnt_u32(unsigned __int32);
_mm_popcnt_u64(unsigned __int64);

You must then include the <mmintrin.h> header file to use these intrinsic functions, however they will work with non-gcc compilers. You may also have to supply a target architecture to get the functions to inline (which is strictly required), using something like -march=native.



来源:https://stackoverflow.com/questions/51387998/count-bits-1-on-an-integer-as-fast-as-gcc-builtin-popcountint

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!