Gather AVX2&512 intrinsic for 16-bit integers?

问题

Imagine this piece of code:

void Function(int16 *src, int *indices, float *dst, int cnt, float mul)
{
    for (int i=0; i<cnt; i++) dst[i] = float(src[indices[i]]) * mul;
};

This really asks for gather intrinsics e.g. _mm_i32gather_epi32. I got great success with these when loading floats, but are there any for 16-bit ints? Another problem here is that I need to transition from 16-bits on the input to 32-bits (float) on the output.

回答1:

There is indeed no instruction to gather 16bit integers, but (assuming there is no risk of memory-access violation) you can just load 32bit integers starting at the corresponding addresses, and mask out the upper halves of each value. For uint16_t this would be a simple bit-and, for signed integers you can shift the values to the left in order to have the sign bit at the most-significant position. You can then (arithmetically) shift back the values before converting them to float, or, since you multiply them anyway, just scale the multiplication factor accordingly. Alternatively, you could load from two bytes earlier and arithmetically shift to the right. Either way, your bottle-neck will likely be the load-ports (vpgatherdd requires 8 load-uops. Together with the load for the indices you have 9 loads distributed on two ports, which should result in 4.5 cycles for 8 elements).

Untested possible AVX2 implementation (does not handle the last elements, if cnt is not a multiple of 8 just execute your original loop at the end):

void Function(int16_t const *src, int const *indices, float *dst, size_t cnt, float mul_)
{
    __m256 mul = _mm256_set1_ps(mul_*float(1.0f/0x10000));
    for (size_t i=0; i+8<=cnt; i+=8){ // todo handle last elements
        // load indicies:
        __m256i idx = _mm256_loadu_si256(reinterpret_cast<__m256i const*>(indices + i));
        // load 16bit integers in the lower halves + garbage in the upper halves:
        __m256i values = _mm256_i32gather_epi32(reinterpret_cast<int const*>(src), idx, 2);
        // shift each value to upper half (removes garbage, makes sure sign is at the right place)
        // values are too large by a factor of 0x10000
        values = _mm256_slli_epi32(values, 16);
        // convert to float, scale and multiply:
        __m256 fvalues = _mm256_mul_ps(_mm256_cvtepi32_ps(values), mul);
        // store result
        _mm256_storeu_ps(dst, fvalues);
    } 
}

Porting this to AVX-512 should be straight-forward.

来源：https://stackoverflow.com/questions/59339611/gather-avx2512-intrinsic-for-16-bit-integers

标签

optimization

avx2

avx512