Why does this simple C++ SIMD benchmark run slower when SIMD instructions are used?
问题 I'm thinking about writing a SIMD vector math library, so as a quick benchmark I wrote a program that does 100 million (4 float) vector element-wise multiplications and adds them to a cumulative total. For my classic, non-SIMD variation I just made a struct with 4 floats and wrote my own multiply function "multiplyTwo" that multiplies two such structs element wise and returns another struct. For my SIMD variation I used "immintrin.h" along with __m128, _mm_set_ps, and _mm_mul_ps. I'm running