Fastest way to multiply two vectors of 32bit integers in C++, with SSE

问题

I have two unsigned vectors, both with size 4

vector<unsigned> v1 = {2, 4, 6, 8}
vector<unsigned> v2 = {1, 10, 11, 13}

Now I want to multiply these two vectors and get a new one

vector<unsigned> v_result = {2*1, 4*10, 6*11, 8*13}

What is the SSE operation to use? Is it cross platform or only in some specified platforms?

Adding: If my goal is adding not multiplication, I can do this super fast:

__m128i a = _mm_set_epi32(1,2,3,4);
__m128i b = _mm_set_epi32(1,2,3,4);
__m128i c;
c = _mm_add_epi32(a,b);

回答1:

Using the set intrinsics such as _mm_set_epi32 for all elements is inefficient. It's better to use the load intrinsics. See this discussion for more on that Where does the SSE instructions outperform normal instructions . If the arrays are 16 byte aligned you can use either _mm_load_si128 or _mm_loadu_si128 (for aligned memory they have nearly the same efficiency) otherwise use _mm_loadu_si128. But aligned memory is much more efficient. To get aligned memory I recommend _mm_malloc and _mm_free, or C11 aligned_alloc so you can use normal free.

To answer the rest of your question, lets assume you have your two vectors loaded in SSE registers __m128i a and __m128i b

For SSE version >=SSE4.1 use

_mm_mullo_epi32(a, b);

Without SSE4.1:

This code is copied from Agner Fog's Vector Class Library (and was plagiarized by the original author of this answer):

// Vec4i operator * (Vec4i const & a, Vec4i const & b) {
// #ifdef
__m128i a13    = _mm_shuffle_epi32(a, 0xF5);          // (-,a3,-,a1)
__m128i b13    = _mm_shuffle_epi32(b, 0xF5);          // (-,b3,-,b1)
__m128i prod02 = _mm_mul_epu32(a, b);                 // (-,a2*b2,-,a0*b0)
__m128i prod13 = _mm_mul_epu32(a13, b13);             // (-,a3*b3,-,a1*b1)
__m128i prod01 = _mm_unpacklo_epi32(prod02,prod13);   // (-,-,a1*b1,a0*b0) 
__m128i prod23 = _mm_unpackhi_epi32(prod02,prod13);   // (-,-,a3*b3,a2*b2) 
__m128i prod   = _mm_unpacklo_epi64(prod01,prod23);   // (ab3,ab2,ab1,ab0)

回答2:

There is _mm_mul_epu32 which is SSE2 only and uses the pmuludq instruction. Since it's an SSE2 instruction 99.9% of all CPUs support it (I think the most modern CPU that doesn't is an AMD Athlon XP).

It has a significant downside in that it only multiplies two integers at a time, because it returns 64-bit results, and you can only fit two of those in a register. This means you'll probably need to do a bunch of shuffling which adds to the cost.

回答3:

Probably _mm_mullo_epi32 is what you need, although its intended use is for signed integers. This should not cause problems as long as v1 and v2 are such small that the most significant bits of these integers are 0. It's SSE 4.1. As an alternative you might want to consider _mm_mul_epu32.

回答4:

You can (if SSE 4.1 is available) use

__m128i _mm_mullo_epi32 (__m128i a, __m128i b);

to multiply packed 32bit integers. Otherwise you'd have to shuffle both packs in order to use _mm_mul_epu32 twice. See @user2088790's answer for explicit code.

Note that you could also use _mm_mul_epi32 but that is SSE4 so you'd rather use _mm_mullo_epi32 anyway.

回答5:

std::transform applies the given function to a range and stores the result in another range

std::vector<unsigned> result;

std::transform( v1.begin()+1, v1.end(), v2.begin()+1, v.begin(),std::multiplies<unsigned>() );

来源：https://stackoverflow.com/questions/17264399/fastest-way-to-multiply-two-vectors-of-32bit-integers-in-c-with-sse

标签

c++

x86

sse

simd

intrinsics