Why does this simple C++ SIMD benchmark run slower when SIMD instructions are used?

问题

I'm thinking about writing a SIMD vector math library, so as a quick benchmark I wrote a program that does 100 million (4 float) vector element-wise multiplications and adds them to a cumulative total. For my classic, non-SIMD variation I just made a struct with 4 floats and wrote my own multiply function "multiplyTwo" that multiplies two such structs element wise and returns another struct. For my SIMD variation I used "immintrin.h" along with __m128, _mm_set_ps, and _mm_mul_ps. I'm running on an i7-8565U processor (whiskey lake) and compiling with: g++ main.cpp -mavx -o test.exe to enable the AVX extension instructions in GCC.

The weird thing is that the SIMD version takes about 1.4 seconds, and the non-SIMD version takes only 1 second. I feel as though I'm doing something wrong, as I thought the SIMD version should run 4 times faster. Any help is appreciated, the code is below. I've placed the Non-SIMD code in comments, the code in it's current form is the SIMD version.

#include "immintrin.h" // for AVX 
#include <iostream>

struct NonSIMDVec {
    float x, y, z, w;
};

NonSIMDVec multiplyTwo(const NonSIMDVec& a, const NonSIMDVec& b);

int main() {
    union { __m128 result; float res[4]; };
    // union { NonSIMDVec result; float res[4]; };

    float total = 0; 
    for(unsigned i = 0; i < 100000000; ++i) {
        __m128 a4 = _mm_set_ps(0.0000002f, 1.23f, 2.0f, (float)i);
        __m128 b4 = _mm_set_ps((float)i, 1.3f, 2.0f, 0.000001f);
        // NonSIMDVec a4 = {0.0000002f, 1.23f, 2.0f, (float)i}; 
        // NonSIMDVec b4 = {(float)i, 1.3f, 2.0f, 0.000001f};

        result = _mm_mul_ps(a4, b4); 
        // result = multiplyTwo(a4, b4);

        total += res[0];
        total += res[1];
        total += res[2];
        total += res[3];
    }

    std::cout << total << '\n';
}

NonSIMDVec multiplyTwo(const NonSIMDVec& a, const NonSIMDVec& b)
{ return {a.x*b.x + a.y*b.y + a.z*b.z + a.w*b.w}; }

回答1:

With optimization disabled (the gcc default is -O0), intrinsics are often terrible. Anti-optimized -O0 code-gen for intrinsics usually hurts a lot (even more than for scalar), and some of the function-like intrinsics introduce extra store/reload overhead. Plus the extra store-forwarding latency of -O0 tends to hurt more because there's less ILP when you do things with 1 vector instead of 4 scalars.

Use gcc -march=native -O3

But even with optimization enabled, your code is still written to destroy the performance of SIMD by doing a horizontal add of each vector inside the loop. See How to Calculate Vector Dot Product Using SSE Intrinsic Functions in C for how to not do that: use _mm_add_ps to accumulate a __m128 total vector, and only horizontal sum it outside the loop.

You bottleneck your loop on FP-add latency by doing scalar total += inside the loop. That loop-carried dependency chain means your loop can't run any faster than 1 float per 4 cycles on your Skylake-derived microarchitecture where addss latency is 4 cycles. (https://agner.org/optimize/)

Even better than __m128 total, use 4 or 8 vectors to hide FP add latency, so your SIMD loop can bottleneck on mul/add (or FMA) throughput instead of latency.

Once you fix that, then as @harold points out the way you're using _mm_set_ps inside the loop will result in pretty bad asm from the compiler. It's not a good choice inside a loop when the operands aren't constants, or at least loop-invariant.

Your example here is clearly artificial; normally you'd be loading SIMD vectors from memory. But if you did need to update a loop counter in a __m128 vector, you might use tmp = _mm_add_ps(tmp, _mm_set_ps(1.0, 0, 0, 0)). Or unroll with adding 1.0, 2.0, 3.0, and 4.0 so the loop-carried dependency is only the += 4.0 in the one element.

x + 0.0 is the identity operation even for FP (except maybe with signed zero) so you can do it to the other elements without changing them.

Or for the low element of a vector, you can use _mm_add_ss (scalar) to only modify it.

来源：https://stackoverflow.com/questions/58365789/why-does-this-simple-c-simd-benchmark-run-slower-when-simd-instructions-are-us

标签

c++

performance

simd

intrinsics

avx