Writing a portable SSE/AVX version of std::copysign

后端 未结 2 741
春和景丽
春和景丽 2021-01-12 10:51

I am currently writing a vectorized version of the QR decomposition (linear system solver) using SSE and AVX intrinsics. One of the substeps requires to select the sign of a

2条回答
  •  Happy的楠姐
    2021-01-12 11:29

    AVX versions for float and double:

    #include 
    
    __m256 copysign_ps(__m256 from, __m256 to) {
        constexpr float signbit = -0.f;
        auto const avx_signbit = _mm256_broadcast_ss(&signbit);
        return _mm256_or_ps(_mm256_and_ps(avx_signbit, from), _mm256_andnot_ps(avx_signbit, to)); // (avx_signbit & from) | (~avx_signbit & to)
    }
    
    __m256d copysign_pd(__m256d from, __m256d to) {
        constexpr double signbit = -0.;
        auto const avx_signbit = _mm256_broadcast_sd(&signbit);
        return _mm256_or_pd(_mm256_and_pd(avx_signbit, from), _mm256_andnot_pd(avx_signbit, to)); // (avx_signbit & from) | (~avx_signbit & to)
    }
    

    assembly

    The Intel Intrinsics Guide


    With AVX2 avx_signbit can be generated with no constants:

    __m256 copysign2_ps(__m256 from, __m256 to) {
        auto a = _mm256_castps_si256(from);
        auto avx_signbit = _mm256_castsi256_ps(_mm256_slli_epi32(_mm256_cmpeq_epi32(a, a), 31));
        return _mm256_or_ps(_mm256_and_ps(avx_signbit, from), _mm256_andnot_ps(avx_signbit, to)); // (avx_signbit & from) | (~avx_signbit & to)
    }
    
    __m256d copysign2_pd(__m256d from, __m256d to) {
        auto a = _mm256_castpd_si256(from);
        auto avx_signbit = _mm256_castsi256_pd(_mm256_slli_epi64(_mm256_cmpeq_epi64(a, a), 63));
        return _mm256_or_pd(_mm256_and_pd(avx_signbit, from), _mm256_andnot_pd(avx_signbit, to)); // (avx_signbit & from) | (~avx_signbit & to)
    }
    

    Still though, both clang and gcc calculate avx_signbit at compile time and replace it with constants loaded from .rodata section, which is, IMO, sub-optimal.

提交回复
热议问题