SSE vector wrapper type performance compared to bare __m128
问题 I found an interesting Gamasutra article about SIMD pitfalls, which states that it is not possible to reach the performance of the "pure" __m128 type with wrapper types. Well I was skeptical, so I downloaded the project files and fabricated a comparable test case. It turned out (for my surprise) that the wrapper version is significantly slower. Since I don't want to talk about just the thin air, the test cases are the following: In the 1st case Vec4 is a simple alias of the __m128 type with