问题
I'm thinking about writing a SIMD vector math library, so as a quick benchmark I wrote a program that does 100 million (4 float) vector element-wise multiplications and adds them to a cumulative total. For my classic, non-SIMD variation I just made a struct with 4 floats and wrote my own multiply function "multiplyTwo" that multiplies two such structs element wise and returns another struct. For my SIMD variation I used "immintrin.h" along with __m128, _mm_set_ps, and _mm_mul_ps. I'm running on an i7-8565U processor (whiskey lake) and compiling with: g++ main.cpp -mavx -o test.exe
to enable the AVX extension instructions in GCC.
The weird thing is that the SIMD version takes about 1.4 seconds, and the non-SIMD version takes only 1 second. I feel as though I'm doing something wrong, as I thought the SIMD version should run 4 times faster. Any help is appreciated, the code is below. I've placed the Non-SIMD code in comments, the code in it's current form is the SIMD version.
#include "immintrin.h" // for AVX
#include <iostream>
struct NonSIMDVec {
float x, y, z, w;
};
NonSIMDVec multiplyTwo(const NonSIMDVec& a, const NonSIMDVec& b);
int main() {
union { __m128 result; float res[4]; };
// union { NonSIMDVec result; float res[4]; };
float total = 0;
for(unsigned i = 0; i < 100000000; ++i) {
__m128 a4 = _mm_set_ps(0.0000002f, 1.23f, 2.0f, (float)i);
__m128 b4 = _mm_set_ps((float)i, 1.3f, 2.0f, 0.000001f);
// NonSIMDVec a4 = {0.0000002f, 1.23f, 2.0f, (float)i};
// NonSIMDVec b4 = {(float)i, 1.3f, 2.0f, 0.000001f};
result = _mm_mul_ps(a4, b4);
// result = multiplyTwo(a4, b4);
total += res[0];
total += res[1];
total += res[2];
total += res[3];
}
std::cout << total << '\n';
}
NonSIMDVec multiplyTwo(const NonSIMDVec& a, const NonSIMDVec& b)
{ return {a.x*b.x + a.y*b.y + a.z*b.z + a.w*b.w}; }
回答1:
With optimization disabled (the gcc default is -O0
), intrinsics are often terrible. Anti-optimized -O0 code-gen for intrinsics usually hurts a lot (even more than for scalar), and some of the function-like intrinsics introduce extra store/reload overhead. Plus the extra store-forwarding latency of -O0
tends to hurt more because there's less ILP when you do things with 1 vector instead of 4 scalars.
Use gcc -march=native -O3
But even with optimization enabled, your code is still written to destroy the performance of SIMD by doing a horizontal add of each vector inside the loop. See How to Calculate Vector Dot Product Using SSE Intrinsic Functions in C for how to not do that: use _mm_add_ps
to accumulate a __m128 total
vector, and only horizontal sum it outside the loop.
You bottleneck your loop on FP-add latency by doing scalar total +=
inside the loop. That loop-carried dependency chain means your loop can't run any faster than 1 float
per 4 cycles on your Skylake-derived microarchitecture where addss
latency is 4 cycles. (https://agner.org/optimize/)
Even better than __m128 total
, use 4 or 8 vectors to hide FP add latency, so your SIMD loop can bottleneck on mul/add (or FMA) throughput instead of latency.
Once you fix that, then as @harold points out the way you're using _mm_set_ps
inside the loop will result in pretty bad asm from the compiler. It's not a good choice inside a loop when the operands aren't constants, or at least loop-invariant.
Your example here is clearly artificial; normally you'd be loading SIMD vectors from memory. But if you did need to update a loop counter in a __m128
vector, you might use tmp = _mm_add_ps(tmp, _mm_set_ps(1.0, 0, 0, 0))
. Or unroll with adding 1.0, 2.0, 3.0, and 4.0 so the loop-carried dependency is only the += 4.0 in the one element.
x + 0.0
is the identity operation even for FP (except maybe with signed zero) so you can do it to the other elements without changing them.
Or for the low element of a vector, you can use _mm_add_ss
(scalar) to only modify it.
来源:https://stackoverflow.com/questions/58365789/why-does-this-simple-c-simd-benchmark-run-slower-when-simd-instructions-are-us