Integer dot product using SSE/AVX?

后端 未结 1 610
刺人心
刺人心 2021-01-06 15:32

I am looking at the intel intrinsic guide:

https://software.intel.com/sites/landingpage/IntrinsicsGuide/

and whilst they have _mm_dp_ps and

相关标签:
1条回答
  • 2021-01-06 16:18

    Every time someone does this:

    temp_1 = _mm_set_epi32(x[j], x[j+1], x[j+2], x[j+3]);
    

    .. a puppy dies.

    Use one of these:

    temp_1 = _mm_load_si128(x);  // if aligned
    temp_1 = _mm_loadu_si128(x); // if not aligned
    

    Cast x as necessary.

    There is no integer version of _mm_dp_ps. But you can do what you were about to do: multiply 4 by 4 integers, accumulate the sum of the products.

    So something like this (not tested, doesn't compile)

    while(j < num_elements_in_array){
        //Load the 4 values from x
        temp_1 = _mm_load_si128(x + j); // add cast
        //Load the 4 values from y
        temp_2 = _mm_load_si128(y + j); // add cast
        j += 4;
        //Multiply x[0] and y[0], x[1] and y[1] etc
        temp_products = _mm_mullo_epi32(temp_1, temp_2);
        //Sum temp_sum
        temp_sum = _mm_add_epi32(temp_sum, temp_products);
    }
    // take horizontal sum of temp_sum
    temp_sum = _mm_add_epi32(temp_sum, _mm_srli_si128(temp_sum, 8));
    temp_sum= _mm_add_epi32(temp_sum, _mm_srli_si128(temp_sum, 4));
    sum = _mm_cvtsi128_si32(temp_sum);
    

    As discussed in the comments and chat, that reorders the sums in such a way as to minimize the number of horizontal sums required, by doing most sums vertically.

    0 讨论(0)
提交回复
热议问题