Add saturate 32-bit signed ints intrinsics?

落爺英雄遲暮 提交于 2019-12-07 16:31:14

问题


Can someone recommend a fast way to add saturate 32-bit signed integers using Intel intrinsics (AVX, SSE4 ...) ?

I looked at the intrinsics guide and found _mm256_adds_epi16 but this seems to only add 16-bit ints. I don't see anything similar for 32 bits. The other calls seem to wrap around.


回答1:


A signed overflow will happen if (and only if):

  • the signs of both inputs are the same, and
  • the sign of the sum (when added with wrap-around) is different from the input

Using C-Operators: overflow = ~(a^b) & (a^(a+b)).

Also, if an overflow happens, the saturated result will have the same sign as either input. Using the int_min = int_max+1 trick suggested by @PeterCordes, and assuming you have at least SSE4.1 (for blendvps) this can be implemented as:

__m128i __mm_adds_epi32( __m128i a, __m128i b )
{
    const __m128i int_max = _mm_set1_epi32( 0x7FFFFFFF );

    // normal result (possibly wraps around)
    __m128i res      = _mm_add_epi32( a, b );

    // If result saturates, it has the same sign as both a and b
    __m128i sign_bit = _mm_srli_epi32(a, 31); // shift sign to lowest bit
    __m128i saturated = _mm_add_epi32(int_max, sign_bit);

    // saturation happened if inputs do not have different signs, 
    // but sign of result is different:
    __m128i sign_xor  = _mm_xor_si128( a, b );
    __m128i overflow = _mm_andnot_si128(sign_xor, _mm_xor_si128(a,res));

    return _mm_castps_si128(_mm_blendv_ps( _mm_castsi128_ps(saturated),
                                          _mm_castsi128_ps( res ),
                                          _mm_castsi128_ps( overflow ) ) );
}

If your blendvps is as fast (or faster) than a shift and an addition (also considering port usage), you can of course just blend int_min and int_max, with the sign-bits of a. Also, if you have only SSE2 or SSE3, you can replace the last blend by an arithmetic shift (of overflow) 31 bits to the right, and manual blending (using and/andnot/or).

And naturally, with AVX2 this can take __m256i variables instead of __m128i (should be very easy to rewrite).

Addendum If you know the sign of either a or b at compile-time, you can directly set saturated accordingly, and you can save both _mm_xor_si128 calculations, i.e., overflow would be _mm_andnot_si128(b, res) for positive a and _mm_andnot(res, b) for negative a (with res = a+b).




回答2:


This link answers this very question:

https://software.intel.com/en-us/forums/topic/285219

Here's an example implementation:

#include <immintrin.h>

__m128i __inline __mm_adds_epi32( __m128i a, __m128i b )
{
    static __m128i int_min = _mm_set1_epi32( 0x80000000 );
    static __m128i int_max = _mm_set1_epi32( 0x7FFFFFFF );

    __m128i res      = _mm_add_epi32( a, b );
    __m128i sign_and = _mm_and_si128( a, b );
    __m128i sign_or  = _mm_or_si128( a, b );

    __m128i min_sat_mask = _mm_andnot_si128( res, sign_and );
    __m128i max_sat_mask = _mm_andnot_si128( sign_or, res );

    __m128 res_temp = _mm_blendv_ps(_mm_castsi128_ps( res ),
                                    _mm_castsi128_ps( int_min ),
                                    _mm_castsi128_ps( min_sat_mask ) );

    return _mm_castps_si128(_mm_blendv_ps( res_temp,
                                          _mm_castsi128_ps( int_max ),
                                          _mm_castsi128_ps( max_sat_mask ) ) );
}

void addSaturate(int32_t* bufferA, int32_t* bufferB, size_t numSamples)
{
    //
    // Load and add
    //
    __m128i* pSrc1 = (__m128i*)bufferA;
    __m128i* pSrc2 = (__m128i*)bufferB;

    for(int i=0; i<numSamples/4; ++i)
    {
        __m128i res = __mm_adds_epi32(*pSrc1, *pSrc2);
        _mm_store_si128(pSrc1, res);

        pSrc1++;
        pSrc2++;
    }
}


来源:https://stackoverflow.com/questions/29498824/add-saturate-32-bit-signed-ints-intrinsics

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!