将32位循环计数器替换为64位会在Intel CPU上使用_mm_popcnt_u64引起疯狂的性能偏差

问题：

I was looking for the fastest way to popcount large arrays of data. 我一直在寻找最快的方法来popcount大量数据的数量。 I encountered a very weird effect: Changing the loop variable from unsigned to uint64_t made the performance drop by 50% on my PC. 我遇到了一个非常奇怪的效果：将循环变量从unsigned更改为uint64_t使PC上的性能下降了50％。

The Benchmark 基准测试

#include <iostream>
#include <chrono>
#include <x86intrin.h>

int main(int argc, char* argv[]) {

    using namespace std;
    if (argc != 2) {
       cerr << "usage: array_size in MB" << endl;
       return -1;
    }

    uint64_t size = atol(argv[1])<<20;
    uint64_t* buffer = new uint64_t[size/8];
    char* charbuffer = reinterpret_cast<char*>(buffer);
    for (unsigned i=0; i<size; ++i)
        charbuffer[i] = rand()%256;

    uint64_t count,duration;
    chrono::time_point<chrono::system_clock> startP,endP;
    {
        startP = chrono::system_clock::now();
        count = 0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with unsigned
            for (unsigned i=0; i<size/8; i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "unsigned\t" << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }
    {
        startP = chrono::system_clock::now();
        count=0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with uint64_t
            for (uint64_t i=0;i<size/8;i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "uint64_t\t"  << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }

    free(charbuffer);
}

As you see, we create a buffer of random data, with the size being x megabytes where x is read from the command line. 如您所见，我们创建了一个随机数据缓冲区，大小为x兆字节，其中从命令行读取x 。 Afterwards, we iterate over the buffer and use an unrolled version of the x86 popcount intrinsic to perform the popcount. 之后，我们遍历缓冲区并使用x86 popcount内部版本的展开版本来执行popcount。 To get a more precise result, we do the popcount 10,000 times. 为了获得更精确的结果，我们将弹出次数进行了10,000次。 We measure the times for the popcount. 我们计算弹出次数的时间。 In the upper case, the inner loop variable is unsigned , in the lower case, the inner loop variable is uint64_t . 在大写情况下，内部循环变量是unsigned ，在小写情况下，内部循环变量是uint64_t 。 I thought that this should make no difference, but the opposite is the case. 我认为这应该没有什么区别，但情况恰恰相反。

The (absolutely crazy) results （绝对疯狂）结果

I compile it like this (g++ version: Ubuntu 4.8.2-19ubuntu1): 我这样编译（g ++版本：Ubuntu 4.8.2-19ubuntu1）：

g++ -O3 -march=native -std=c++11 test.cpp -o test

Here are the results on my Haswell Core i7-4770K CPU @ 3.50 GHz, running test 1 (so 1 MB random data): 这是我的Haswell Core i7-4770K CPU @ 3.50 GHz，运行test 1 （因此有1 MB随机数据）的结果：

unsigned 41959360000 0.401554 sec 26.113 GB/s 未签名41959360000 0.401554秒26.113 GB / s
uint64_t 41959360000 0.759822 sec 13.8003 GB/s uint64_t 41959360000 0.759822秒13.8003 GB / s

As you see, the throughput of the uint64_t version is only half the one of the unsigned version! 如您所见， uint64_t版本的吞吐量仅为 unsigned版本的吞吐量的一半！ The problem seems to be that different assembly gets generated, but why? 问题似乎在于生成了不同的程序集，但是为什么呢？ First, I thought of a compiler bug, so I tried clang++ (Ubuntu Clang version 3.4-1ubuntu3): 首先，我想到了编译器错误，因此尝试了clang++ （Ubuntu Clang版本3.4-1ubuntu3）：

clang++ -O3 -march=native -std=c++11 teest.cpp -o test

Result: test 1 结果： test 1

unsigned 41959360000 0.398293 sec 26.3267 GB/s 无符号41959360000 0.398293秒26.3267 GB / s
uint64_t 41959360000 0.680954 sec 15.3986 GB/s uint64_t 41959360000 0.680954秒15.3986 GB / s

So, it is almost the same result and is still strange. 因此，它几乎是相同的结果，但仍然很奇怪。 But now it gets super strange. 但是现在变得超级奇怪。 I replace the buffer size that was read from input with a constant 1 , so I change: 我用常量1替换了从输入中读取的缓冲区大小，因此我进行了更改：

uint64_t size = atol(argv[1]) << 20;

to 至

uint64_t size = 1 << 20;

Thus, the compiler now knows the buffer size at compile time. 因此，编译器现在在编译时就知道缓冲区的大小。 Maybe it can add some optimizations! 也许它可以添加一些优化！ Here are the numbers for g++ : 这是g++的数字：

unsigned 41959360000 0.509156 sec 20.5944 GB/s 未签名41959360000 0.509156秒20.5944 GB / s
uint64_t 41959360000 0.508673 sec 20.6139 GB/s uint64_t 41959360000 0.508673秒20.6139 GB / s

Now, both versions are equally fast. 现在，两个版本都同样快。 However, the unsigned got even slower ! 但是， unsigned 变得更慢 ！ It dropped from 26 to 20 GB/s , thus replacing a non-constant by a constant value lead to a deoptimization . 它从26 20 GB/s下降到20 GB/s ，因此用恒定值替换非恒定值会导致优化不足 。 Seriously, I have no clue what is going on here! 说真的，我不知道这是怎么回事！ But now to clang++ with the new version: 但是现在使用新版本的clang++ ：

unsigned 41959360000 0.677009 sec 15.4884 GB/s 未签名41959360000 0.677009秒15.4884 GB / s
uint64_t 41959360000 0.676909 sec 15.4906 GB/s uint64_t 41959360000 0.676909秒15.4906 GB / s

Wait, what? 等一下 Now, both versions dropped to the slow number of 15 GB/s. 现在，两个版本的速度都降低到了15 GB / s的缓慢速度 。 Thus, replacing a non-constant by a constant value even lead to slow code in both cases for Clang! 因此，在两种情况下，用常数替换非常数都会导致Clang！代码缓慢。

I asked a colleague with an Ivy Bridge CPU to compile my benchmark. 我要求具有Ivy Bridge CPU的同事来编译我的基准测试。 He got similar results, so it does not seem to be Haswell. 他得到了类似的结果，因此似乎不是Haswell。 Because two compilers produce strange results here, it also does not seem to be a compiler bug. 因为两个编译器在这里产生奇怪的结果，所以它似乎也不是编译器错误。 We do not have an AMD CPU here, so we could only test with Intel. 我们这里没有AMD CPU，因此只能在Intel上进行测试。

More madness, please! 请更加疯狂！

Take the first example (the one with atol(argv[1]) ) and put a static before the variable, ie: 以第一个示例（带有atol(argv[1])示例）为例，然后在变量前放置一个static变量，即：

static uint64_t size=atol(argv[1])<<20;

Here are my results in g++: 这是我在g ++中的结果：

unsigned 41959360000 0.396728 sec 26.4306 GB/s 无符号41959360000 0.396728秒26.4306 GB / s
uint64_t 41959360000 0.509484 sec 20.5811 GB/s uint64_t 41959360000 0.509484秒20.5811 GB / s

Yay, yet another alternative . 是的，另一种选择 。 We still have the fast 26 GB/s with u32 , but we managed to get u64 at least from the 13 GB/s to the 20 GB/s version! 我们仍然有快26 GB / s的u32 ，但我们设法u64从13 GB至少/ S到20 GB / s的版本！ On my collegue's PC, the u64 version became even faster than the u32 version, yielding the fastest result of all. 在我collegue的PC中， u64版本成为速度甚至超过了u32的版本，产生所有的最快的结果。 Sadly, this only works for g++ , clang++ does not seem to care about static . 可悲的是，这仅适用于g++ ， clang++似乎并不关心static 。

My question 我的问题

Can you explain these results? 您能解释这些结果吗？ Especially: 特别：

How can there be such a difference between u32 and u64 ? 哪有之间的这种差异u32和u64 ？
How can replacing a non-constant by a constant buffer size trigger less optimal code ? 如何用恒定的缓冲区大小替换非常数会触发次优代码 ？
How can the insertion of the static keyword make the u64 loop faster? 插入static关键字如何使u64循环更快？ Even faster than the original code on my collegue's computer! 甚至比同事计算机上的原始代码还要快！

I know that optimization is a tricky territory, however, I never thought that such small changes can lead to a 100% difference in execution time and that small factors like a constant buffer size can again mix results totally. 我知道优化是一个棘手的领域，但是，我从未想到过如此小的更改会导致执行时间差异100％ ，而诸如恒定缓冲区大小之类的小因素又会完全混和结果。 Of course, I always want to have the version that is able to popcount 26 GB/s. 当然，我一直希望拥有能够以26 GB / s的速度计数的版本。 The only reliable way I can think of is copy paste the assembly for this case and use inline assembly. 我能想到的唯一可靠的方法是针对这种情况复制粘贴程序集并使用内联程序集。 This is the only way I can get rid of compilers that seem to go mad on small changes. 这是我摆脱似乎对微小更改感到恼火的编译器的唯一方法。 What do you think? 你怎么看？ Is there another way to reliably get the code with most performance? 还有另一种方法可以可靠地获得性能最高的代码吗？

The Disassembly 拆卸

Here is the disassembly for the various results: 这是各种结果的反汇编：

26 GB/s version from g++ / u32 / non-const bufsize : 来自g ++ / u32 / non-const bufsize的 26 GB / s版本：

0x400af8:
lea 0x1(%rdx),%eax
popcnt (%rbx,%rax,8),%r9
lea 0x2(%rdx),%edi
popcnt (%rbx,%rcx,8),%rax
lea 0x3(%rdx),%esi
add %r9,%rax
popcnt (%rbx,%rdi,8),%rcx
add $0x4,%edx
add %rcx,%rax
popcnt (%rbx,%rsi,8),%rcx
add %rcx,%rax
mov %edx,%ecx
add %rax,%r14
cmp %rbp,%rcx
jb 0x400af8

13 GB/s version from g++ / u64 / non-const bufsize : 来自g ++ / u64 / non-const bufsize的 13 GB / s版本：

0x400c00:
popcnt 0x8(%rbx,%rdx,8),%rcx
popcnt (%rbx,%rdx,8),%rax
add %rcx,%rax
popcnt 0x10(%rbx,%rdx,8),%rcx
add %rcx,%rax
popcnt 0x18(%rbx,%rdx,8),%rcx
add $0x4,%rdx
add %rcx,%rax
add %rax,%r12
cmp %rbp,%rdx
jb 0x400c00

15 GB/s version from clang++ / u64 / non-const bufsize : 来自clang ++ / u64 / non-const bufsize的 15 GB / s版本：

0x400e50:
popcnt (%r15,%rcx,8),%rdx
add %rbx,%rdx
popcnt 0x8(%r15,%rcx,8),%rsi
add %rdx,%rsi
popcnt 0x10(%r15,%rcx,8),%rdx
add %rsi,%rdx
popcnt 0x18(%r15,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp %rbp,%rcx
jb 0x400e50

20 GB/s version from g++ / u32&u64 / const bufsize : 来自g ++ / u32＆u64 / const bufsize的 20 GB / s版本：

0x400a68:
popcnt (%rbx,%rdx,1),%rax
popcnt 0x8(%rbx,%rdx,1),%rcx
add %rax,%rcx
popcnt 0x10(%rbx,%rdx,1),%rax
add %rax,%rcx
popcnt 0x18(%rbx,%rdx,1),%rsi
add $0x20,%rdx
add %rsi,%rcx
add %rcx,%rbp
cmp $0x100000,%rdx
jne 0x400a68

15 GB/s version from clang++ / u32&u64 / const bufsize : 来自clang ++ / u32＆u64 / const bufsize的 15 GB / s版本：

0x400dd0:
popcnt (%r14,%rcx,8),%rdx
add %rbx,%rdx
popcnt 0x8(%r14,%rcx,8),%rsi
add %rdx,%rsi
popcnt 0x10(%r14,%rcx,8),%rdx
add %rsi,%rdx
popcnt 0x18(%r14,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp $0x20000,%rcx
jb 0x400dd0

Interestingly, the fastest (26 GB/s) version is also the longest! 有趣的是，最快的版本（26 GB / s）也是最长的！ It seems to be the only solution that uses lea . 它似乎是使用lea的唯一解决方案。 Some versions use jb to jump, others use jne . 一些版本使用jb跳转，另一些版本使用jne 。 But apart from that, all versions seem to be comparable. 但是除此之外，所有版本似乎都是可比的。 I don't see where a 100% performance gap could originate from, but I am not too adept at deciphering assembly. 我看不出100％的性能差距可能源于何处，但我不太擅长破译汇编。 The slowest (13 GB/s) version looks even very short and good. 最慢的版本（13 GB / s）看起来非常短而且很好。 Can anyone explain this? 谁能解释一下？

Lessons learned 得到教训

No matter what the answer to this question will be; 不管这个问题的答案是什么； I have learned that in really hot loops every detail can matter, even details that do not seem to have any association to the hot code . 我了解到，在真正的热循环中，每个细节都可能很重要， 即使那些似乎与该热代码没有任何关联的细节也是如此 。 I have never thought about what type to use for a loop variable, but as you see such a minor change can make a 100% difference! 我从未考虑过要为循环变量使用哪种类型，但是如您所见，如此小的更改可能会产生100％的差异！ Even the storage type of a buffer can make a huge difference, as we saw with the insertion of the static keyword in front of the size variable! 正如我们在size变量前面插入static关键字所看到的那样，甚至缓冲区的存储类型也可以产生巨大的变化！ In the future, I will always test various alternatives on various compilers when writing really tight and hot loops that are crucial for system performance. 将来，在编写对系统性能至关重要的紧密而又热的循环时，我将始终在各种编译器上测试各种替代方案。

The interesting thing is also that the performance difference is still so high although I have already unrolled the loop four times. 有趣的是，尽管我已经四次展开循环，但性能差异仍然很高。 So even if you unroll, you can still get hit by major performance deviations. 因此，即使展开，仍然会受到重大性能偏差的影响。 Quite interesting. 挺有意思。

解决方案：

参考一： https://stackoom.com/question/1hE0T/将-位循环计数器替换为-位会在Intel-CPU上使用-mm-popcnt-u-引起疯狂的性能偏差
参考二： https://oldbug.net/q/1hE0T/Replacing-a-32-bit-loop-counter-with-64-bit-introduces-crazy-performance-deviations-with-mm-popcnt-u64-on-Intel-CPUs

来源：oschina

链接：https://my.oschina.net/u/4438370/blog/4305610

标签

clang

Rax

Ubuntu

Here