How to implement sign function with SSE3?

问题

1) Is there a way to efficiently implement sign function using SSE3 (no SSE4) with the following characteristics?

the input is a float vector __m128.
the output should be also __m128 with [-1.0f, 0.0f, 1.0f] as its values

I tried this, but it didn't work (though I think it should):

inputVal = _mm_set_ps(-0.5, 0.5, 0.0, 3.0);
comp1 = _mm_cmpgt_ps(_mm_setzero_ps(), inputVal);
comp2 = _mm_cmpgt_ps(inputVal, _mm_setzero_ps());
comp1 = _mm_castsi128_ps(_mm_castps_si128(comp1));
comp2 = _mm_castsi128_ps(_mm_castps_si128(comp2));
signVal = _mm_sub_ps(comp1, comp2);

2) Is there a way to create a "flag" function (I'm not sure about the right name). Namely, if A > B the result will be 1 and 0 otherwise. The result should be floating-point (__m128) just like its input.

UPDATE: It seems Cory Nelson answer will work here:

__m128 greatherThanFlag = _mm_and_ps(_mm_cmpgt_ps(valA, valB), _mm_set1_ps(1.0f));    
__m128 lessThanFlag = _mm_and_ps(_mm_cmplt_ps(valA, valB), _mm_set1_ps(1.0f));

回答1:

First that comes to mind is perhaps the simplest:

__m128 sign(__m128 x)
{
    __m128 zero = _mm_setzero_ps();

    __m128 positive = _mm_and_ps(_mm_cmpgt_ps(x, zero), _mm_set1_ps(1.0f));
    __m128 negative = _mm_and_ps(_mm_cmplt_ps(x, zero), _mm_set1_ps(-1.0f));

    return _mm_or_ps(positive, negative);
}

Or, if you misspoke and intended to get an integer result:

__m128i sign(__m128 x)
{
    __m128 zero = _mm_setzero_ps();

    __m128 positive = _mm_and_ps(_mm_cmpgt_ps(x, zero),
                                 _mm_castsi128_ps(_mm_set1_epi32(1)));
    __m128 negative = _mm_cmplt_ps(x, zero);

    return _mm_castps_si128(_mm_or_ps(positive, negative));
}

回答2:

If it's ok for sgn(-0.0f) to produce an output of -0.0f instead of +0.0f, you can save an instruction or two compared to @Cory Nelson's version. See below for a version which also propagates NaN.

select 0.0 or 1.0 based on a compare for x != 0.0f
copy the sign bit of x to the that.

// return -0.0 for x=-0.0, otherwise the same as Cory's (except for NaN which neither handle well)
__m128 sgn_fast(__m128 x)
{
    __m128 negzero = _mm_set1_ps(-0.0f);

    // using _mm_setzero_ps() here might actually be better without AVX, since xor-zeroing is as cheap as a copy but starts a new dependency chain
    //__m128 nonzero = _mm_cmpneq_ps(x, negzero);  // -0.0 == 0.0 in IEEE floating point
    __m128 nonzero = _mm_cmpneq_ps(x, _mm_setzero_ps());

    __m128 x_signbit = _mm_and_ps(x, negzero);

    __m128 zeroone = _mm_and_ps(nonzero, _mm_set1_ps(1.0f));
    return _mm_or_ps(zeroone, x_signbit);
}

When the input is NaN, I think it returns +/-1.0f, according to the sign of the NaN. (Since _mm_cmpneq_ps() is true when x is NaN: see the table on the CMPPD instruction).

Without AVX, this is two fewer instructions than Cory's version (with clang3.9 on the Godbolt compiler explorer). When inlined into a loop, the memory source operands could be register source operands. gcc uses more instructions, doing a separate MOVAPS load and painting itself into a corner that requires an extra MOVAPS to get the return value into xmm0.

    xorps   xmm1, xmm1
    cmpneqps        xmm1, xmm0
    andps   xmm0, xmmword ptr [rip + .LCPI0_0]    # x_signbit
    andps   xmm1, xmmword ptr [rip + .LCPI0_1]    # zeroone
    orps    xmm0, xmm1

The critical-path latency is cmpneqps + andps + orps, which is 3+1+1 cycles on Intel Haswell for example. Cory's version needs to run two cmpps instructions in parallel to achieve that latency, which is only possible on Skylake. Other CPUs will have a resource conflict causing an extra cycle of latency.

To propagate NaN, so the possible outputs would be -1.0f, -/+0.0f, 1.0f, and NaN, we could take advantage of the fact that the all-ones bit pattern is a NaN.

_mm_cmpunord_ps(x,x) to get a NaN-mask. (Or equivalently, cmpneqps)
or that onto the result to leave it unmodified or force it to NaN.

// return -0.0 for x=-0.0.  Return -NaN for any NaN
__m128 sgn_fast_nanpropagating(__m128 x)
{
    __m128 negzero = _mm_set1_ps(-0.0f);
    __m128 nonzero = _mm_cmpneq_ps(x, _mm_setzero_ps());

    __m128 x_signbit = _mm_and_ps(x, negzero);
    __m128 nanmask   = _mm_cmpunord_ps(x,x);
    __m128 x_sign_or_nan = _mm_or_ps(x_signbit, nanmask);   // apply it here instead of to the final result for better ILP

    __m128 zeroone = _mm_and_ps(nonzero, _mm_set1_ps(1.0f));
    return _mm_or_ps(zeroone, x_sign_or_nan);
}

This compiles efficiently, and barely lengthens the critical path latency. It does take more MOVAPS instructions to copy registers without AVX, though.

You might be able to do something useful with SSE4.1 BLENDVPS, but it's not the most efficient instruction on all CPUs. It's also hard to avoid treating negative zero as non-zero.

If you want an integer result, you can use SSSE3 _mm_sign_epi32(set1(1), x) to get a -1, 0, or 1 output. If -0.0f -> -1 is too sloppy, you can fix that up by ANDing with the result of _mm_cmpneq_ps(x, _mm_setzero_ps())

// returns -1 for x = -0.0f
__m128i sgn_verysloppy_int_ssse3(__m128 x) {
  __m128i one = _mm_set1_epi32(1);
  __m128i sign = _mm_sign_epi32(one, _mm_castps_si128(x));
  return sign;
}

// correct results for all inputs
// NaN -> -1 or 1 according to its sign bit, never 0
__m128i sgn_int_ssse3(__m128 x) {
  __m128i one = _mm_set1_epi32(1);
  __m128i sign = _mm_sign_epi32(one, _mm_castps_si128(x));

  __m128  nonzero = _mm_cmpneq_ps(x, _mm_setzero_ps());
    return _mm_and_si128(sign, _mm_castps_si128(nonzero));
}

回答3:

If you need a signum function for float vectors, where the result is a int32_t vector, and you don't care about NaNs, then a more efficient version can be implemented using integer instructions, based on the following theory.

If you take a floating-point number and reinterpret the bits as a signed twos complement integer, you can get 3 distinct cases (where X is an arbitrary 0 or 1, and the bold MSB is the sign bit):

0 X X X X X X X X X X X X X X 1, which is > 0 (or > 0.0f as float)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0, which is == 0 (or == 0.0f as float)
1 X X X X X X X X X X X X X X X, which is < 0 (or <= 0.0f as float)

The last case is ambiguous, since it can be the special floating-point case of negative zero -0.0f:

1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0, which is == -0.0f == 0.0f as float

From this point the floating-point signum function turns into an integer function.

Using intrinsics available with SSE3 (not SSSE3) this can be implemented as:

inline __m128i _mm_signum_ps(__m128 a)
{
    __m128i x = _mm_castps_si128(a);

    __m128i zero = _mm_setzero_si128();
    __m128i m0 = _mm_cmpgt_epi32(x, zero);
    __m128i m1 = _mm_cmplt_epi32(x, zero);
    __m128i m2 = _mm_cmpeq_epi32(x, _mm_set1_epi32(0x80000000));

    __m128i p = _mm_and_si128(m0, _mm_set1_epi32(+1));

    // Note that since (-1 == 0xFFFFFFFF) in two's complement,
    // n satisfies (n == m1), so the below line is strictly semantic
    // __m128i n = _mm_and_si128(m1, _mm_set1_epi32(-1));
    __m128i n = m1;

    return _mm_andnot_si128(m2, _mm_or_si128(p, n));
}

The optimized version of this is

inline __m128i _mm_signum_ps(__m128 a)
{
    __m128i x = _mm_castps_si128(a);

    __m128i zr = _mm_setzero_si128();
    __m128i m0 = _mm_cmpeq_epi32(x, _mm_set1_epi32(0x80000000));
    __m128i mp = _mm_cmpgt_epi32(x, zr);
    __m128i mn = _mm_cmplt_epi32(x, zr);

    return _mm_or_si128(
      _mm_andnot_si128(m0, mn),
      _mm_and_si128(mp, _mm_set1_epi32(1))
    );
}

As Peter suggested in the comments, using one floating-point comparison _mm_cmplt_ps instead of two integer comparison _mm_cmplt_epi32/_mm_cmpeq_epi32 to handle -0.0f saves 1 latency, but it might suffers from bypass delay latency because of switching between floating-point/integer domains, so it's maybe better to stick with the integer only implementation above. Or not. Since you need an integer result it's more likely that you will use it and swap to integer domain anyway. So:

inline __m128i _mm_signum_ps(__m128 a)
{
    __m128i x = _mm_castps_si128(a);
    __m128 zerops = _mm_setzero_ps();

    __m128i mn = _mm_castps_si128(_mm_cmplt_ps(a, zerops));
    __m128i mp = _mm_cmpgt_epi32(x, _mm_castps_si128(zerops));

    return _mm_or_si128(mn, _mm_and_si128(mp, _mm_set1_epi32(1)));
}

With -march=x86-64 -msse3 -O3 in clang 3.9 this compiles to

_mm_signum_ps(float __vector(4)):                # @_mm_signum2_ps(float __vector(4))
        xorps   xmm1, xmm1                       # fp domain
        movaps  xmm2, xmm0                       # fp domain
        cmpltps xmm2, xmm1                       # fp domain
        pcmpgtd xmm0, xmm1                       # int domain
        psrld   xmm0, 31                         # int domain
        por     xmm0, xmm2                       # int domain
        ret

Except cmpltps, the latency of each instruction here is 1 with throughputs <= 1. I think it is a really efficient solution, and it can be further improved with SSSE3's _mm_sign_epi32.

If you need floating-point results, it is better to stay entirely in floating-point domain (instead of swapping between floating-point/integer domains) so use one of Peter's solutions.

回答4:

You're close, but your code doesn't quite work because you are trying to convert a 0/-1 int to float using only a cast.

Try this (untested):

inputVal = _mm_set_ps(-0.5, 0.5, 0.0, 3.0);
comp1 = _mm_cmpgt_ps(_mm_setzero_ps(), inputVal);
comp2 = _mm_cmpgt_ps(inputVal, _mm_setzero_ps());
comp1 = _mm_cvtepi32_ps(_mm_castps_si128(comp1)); // 0/-1 => 0.0f/-1.0f
comp2 = _mm_cvtepi32_ps(_mm_castps_si128(comp2));
signVal = _mm_sub_ps(comp1, comp2);

Having said that, I think Cory's solution is probably more efficient.

来源：https://stackoverflow.com/questions/41315420/how-to-implement-sign-function-with-sse3

标签

x86

vectorization

x86-64

sse

simd