AVX 4-bit integers

[亡魂溺海] 提交于 2021-02-18 12:12:32

问题


I need to perform the following operation:

 w[i] = scale * v[i] + point

scale and point are fixed, whereas v[] is a vector of 4-bit integers.

I need to compute w[] for the arbitrary input vector v[] and I want to speed up the process using AVX intrinsics. However, v[i] is a vector of 4-bit integers.

The question is how to perform operations on 4-bit integers using intrinsics? I could use 8-bit integers and perform operations that way, but is there a way to do the following:

[a,b] + [c,d] = [a+b,c+d]

[a,b] * [c,d] = [a * b,c * d]

(Ignoring overflow)

Using AVX intrinsics, where [...,...] Is an 8-bit integer and a,b,c,d are 4-bit integers?

If yes, would it be possible to give a short example on how this could work?


回答1:


Just a partial answer (only addition) and in pseudo code (should be easy to extent to AVX2 intrinsics):

uint8_t a, b;          // input containing two nibbles each

uint8_t c = a + b;     // add with (unwanted) carry between nibbles
uint8_t x = a ^ b ^ c; // bits which are result of a carry
x &= 0x10;             // only bit 4 is of interest
c -= x;                // undo carry of lower to upper nibble

If either a or b is known to have bit 4 unset (i.e. the lowest bit of the upper nibble), it can be left out the computation of x.

As for multiplication: If scale is the same for all products, you can likely get away with some shifting and adding/subtracting (masking out overflow bits where necessarry). Otherwise, I'm afraid you need to mask out 4 bits of each 16bit word, do the operation, and fiddle them together at the end. Pseudo code (there is no AVX 8bit multiplication, so we need to operate with 16bit words):

uint16_t m0=0xf, m1=0xf0, m2=0xf00, m3=0xf000; // masks for each nibble

uint16_t a, b; // input containing 4 nibbles each.

uint16_t p0 = (a*b) & m0; // lowest nibble, does not require masking a,b
uint16_t p1 = ((a>>4) * (b&m1)) & m1;
uint16_t p2 = ((a>>8) * (b&m2)) & m2;
uint16_t p3 = ((a>>12)* (b&m3)) & m3;

uint16_t result = p0 | p1 | p2 | p3;  // join results together 



回答2:


For fixed a, b in w[i]=v[i] * a + b, you can simply use a lookup table w_0_3 = _mm_shuffle_epi8(LUT_03, input) for the LSB. Split the input to even and odd nibbles, with the odd LUT preshifted by 4.

auto a = input & 15; // per element
auto b = (input >> 4) & 15; // shift as 16 bits
return LUTA[a] | LUTB[b];

How to generate those LUTs dynamically, is another issue, if at all.




回答3:


4-bit aditions/multiplication can be done using AVX2, particularly if you want to apply those computations on larger vectors (say more than 128 elements). However, if you want to add just 4 numbers use straight scalar code.

We have done an extensive work on how to deal with 4-bit integers, and we have recently developed a library to do it Clover: 4-bit Quantized Linear Algebra Library (with focus on quantization). The code is also available at GitHub.

As you mentioned only 4-bit integers, I would assume that you are referring to signed integers (i.e. two's complements), and base my answer accordingly. Note that handling unsigned is in fact much simpler.

I would also assume that you would like to take vector int8_t v[n/2] that contains n 4-bit integers, and produce int8_t v_sum[n/4] having n/2 4-bit integers. All the code relative to the description bellow is available as a gist.

Packing / Unpacking

Obviously AVX2 does not offer any instructions to perform additions / multiplication on 4-bit integers, therefore, you must resort to the given 8- or 16-bit instruction. The first step in dealing with 4-bit arithmetics is to devise methods on how to place the 4-bit nibble into larger chunks of 8-, 16-, or 32-bit chunk.

For a sake of clarity let's assume that you want to unpack a given nibble from a 32-bit chunk that stores multiple 4-bit signed values into a corresponding 32-bit integer (figure below). This can be done with two bit shifts:

  1. a logical left shift is used to shift the nibble so that it occupies the highest-order 4-bits of the 32-bit entity.
  2. an arithmetic right shift is used to shift the nibble to the lowest order 4-bits of the 32-bit entity.

The arithmetic right shift has sign extension, filling the high-order 28 bits with the sign bit of the nibble. yielding a 32-bit integer with the same value as the two’s complement 4-bit value.

The goal of packing (left part of figure above) is to revert the unpacking operation. Two bit shifts can be used to place the lowest order 4 bits of a 32-bit integer anywhere within a 32-bit entity.

  1. a logical left shift is used to shift the nibble so that it occupies the highest-order 4-bits of the 32-bit entity.
  2. a logical right shift is used to shift the nibble to somewhere within the 32-bit entity.

The first sets the bits lower-ordered than the nibble to zero, and the second sets the bits higher-ordered than the nibble to zero. A bitwise OR operation can then be used to store up to eight nibbles in the 32-bit entity.

How to apply this in practice?

Let's assume that you have 64 x 32-bit integer values stored in 8 AVX registers __m256i q_1, q_2, q_3, q_4, q_5, q_6, q_7, q_8. Let's also assume that each value is in the [-8, 7], range. If you want to pack them into a single AVX register of 64 x 4-bit values, you can do as follows:

//
// Transpose the 8x8 registers
//
_mm256_transpose8_epi32(q_1, q_2, q_3, q_4, q_5, q_6, q_7, q_8);
//
// Shift values left
//
q_1 = _mm256_slli_epi32(q_1, 28);
q_2 = _mm256_slli_epi32(q_2, 28);
q_3 = _mm256_slli_epi32(q_3, 28);
q_4 = _mm256_slli_epi32(q_4, 28);
q_5 = _mm256_slli_epi32(q_5, 28);
q_6 = _mm256_slli_epi32(q_6, 28);
q_7 = _mm256_slli_epi32(q_7, 28);
q_8 = _mm256_slli_epi32(q_8, 28);
//
// Shift values right (zero-extend)
//
q_1 = _mm256_srli_epi32(q_1, 7 * 4);
q_2 = _mm256_srli_epi32(q_2, 6 * 4);
q_3 = _mm256_srli_epi32(q_3, 5 * 4);
q_4 = _mm256_srli_epi32(q_4, 4 * 4);
q_5 = _mm256_srli_epi32(q_5, 3 * 4);
q_6 = _mm256_srli_epi32(q_6, 2 * 4);
q_7 = _mm256_srli_epi32(q_7, 1 * 4);
q_8 = _mm256_srli_epi32(q_8, 0 * 4);
//
// Pack together
//
__m256i t1 = _mm256_or_si256(q_1, q_2);
__m256i t2 = _mm256_or_si256(q_3, q_4);
__m256i t3 = _mm256_or_si256(q_5, q_6);
__m256i t4 = _mm256_or_si256(q_7, q_8);
__m256i t5 = _mm256_or_si256(t1, t2);
__m256i t6 = _mm256_or_si256(t3, t4);
__m256i t7 = _mm256_or_si256(t5, t6);

Shifts usually take 1 cycle of throughput, and 1 cycle of latency, thus you can assume that are in fact quite inexpensive. If you have to deal with unsigned 4-bit values, the left shifts can be skipped all together.

To reverse the procedure, you can apply the same method. Let's assume that you have loaded 64 4-bit values into a single AVX register __m256i qu_64. In order to produce 64 x 32-bit integers __m256i q_1, q_2, q_3, q_4, q_5, q_6, q_7, q_8, you can execute the following:

//
// Shift values left
//
const __m256i qu_1 = _mm256_slli_epi32(qu_64, 4 * 7);
const __m256i qu_2 = _mm256_slli_epi32(qu_64, 4 * 6);
const __m256i qu_3 = _mm256_slli_epi32(qu_64, 4 * 5);
const __m256i qu_4 = _mm256_slli_epi32(qu_64, 4 * 4);
const __m256i qu_5 = _mm256_slli_epi32(qu_64, 4 * 3);
const __m256i qu_6 = _mm256_slli_epi32(qu_64, 4 * 2);
const __m256i qu_7 = _mm256_slli_epi32(qu_64, 4 * 1);
const __m256i qu_8 = _mm256_slli_epi32(qu_64, 4 * 0);
//
// Shift values right (sign-extent) and obtain 8x8
// 32-bit values
//
__m256i q_1 = _mm256_srai_epi32(qu_1, 28);
__m256i q_2 = _mm256_srai_epi32(qu_2, 28);
__m256i q_3 = _mm256_srai_epi32(qu_3, 28);
__m256i q_4 = _mm256_srai_epi32(qu_4, 28);
__m256i q_5 = _mm256_srai_epi32(qu_5, 28);
__m256i q_6 = _mm256_srai_epi32(qu_6, 28);
__m256i q_7 = _mm256_srai_epi32(qu_7, 28);
__m256i q_8 = _mm256_srai_epi32(qu_8, 28);
//
// Transpose the 8x8 values
//
_mm256_transpose8_epi32(q_1, q_2, q_3, q_4, q_5, q_6, q_7, q_8);            

If dealing with unsigned 4-bits, the right shifts (_mm256_srai_epi32) can be skipped all-together, and instead of left shifts, we can perform left-logical shifts (_mm256_srli_epi32 ).

To see more details have a look a the gist here.

Adding Odd and Even 4-bit entries

Let's assume that you load from the vector using AVX:

const __m256i qv = _mm256_loadu_si256( ... );

Now, we can easily extract the odd and the even parts. Life would have been much easier if there were 8-bit shifts in AVX2, but there are none, so we have to deal with 16-bit shifts:

const __m256i hi_mask_08   = _mm256_set1_epi8(-16);
const __m256i qv_odd_dirty = _mm256_slli_epi16(qv, 4);
const __m256i qv_odd_shift = _mm256_and_si256(hi_mask_08, qv_odd_dirty);
const __m256i qv_evn_shift = _mm256_and_si256(hi_mask_08, qv);

At this point in time, you have essentially separated the odd and the even nibbles, in two AVX registers that hold their values in the high 4-bits (i.e. values in the range [-8 * 2^4, 7 * 2^4]). The procedure is the same even when dealing with unsigned 4-bit values. Now it is time to add the values.

const __m256i qv_sum_shift = _mm256_add_epi8(qv_odd_shift, qv_evn_shift);

This will work with both signed and unsigned, as binary addition work with two's complements. However, if you want to avoid overflows or underflows you can also consider addition with saturation already supported in AVX (for both signed and unsigned):

__m256i _mm256_adds_epi8 (__m256i a, __m256i b)
__m256i _mm256_adds_epu8 (__m256i a, __m256i b)

qv_sum_shift will be in the range [-8 * 2^4, 7 * 2^4]. To set it to the right value, we need to shift it back (Note that if qv_sum has to be unsigned, we can use _mm256_srli_epi16 instead):

const __m256i qv_sum = _mm256_srai_epi16(qv_sum_shift, 4);

The summation is now complete. Depending on your use case, this could as well be the end of the program, assuming that you want to produce 8-bit chunks of memory as a result. But let's assume that you want to solve a harder problem. Let's assume that the output is again a vector of 4-bit elements, with the same memory layout as the input one. In that case, we need to pack the 8-bit chunks into 4-bit chunks. However, the problem is that instead of having 64 values, we will end up with 32 values (i.e. half the size of the vector).

From this point there are two options. We either look ahead in the vector, processing 128 x 4-bit values, such that we produce 64 x 4-bit values. Or we revert to SSE, dealing with 32 x 4-bit values. Either way, the fastest way to pack the 8-bit chunks into 4-bit chunks would be to use the vpackuswb (or packuswb for SSE) instruction:

__m256i _mm256_packus_epi16 (__m256i a, __m256i b)

This instruction convert packed 16-bit integers from a and b to packed 8-bit integers using unsigned saturation, and store the results in dst. This means that we have to interleave the odd and even 4-bit values, such that they reside in the 8 low-bits of a 16-bit memory chunk. We can proceed as follows:

const __m256i lo_mask_16 = _mm256_set1_epi16(0x0F);
const __m256i hi_mask_16 = _mm256_set1_epi16(0xF0);

const __m256i qv_sum_lo       = _mm256_and_si256(lo_mask_16, qv_sum);
const __m256i qv_sum_hi_dirty = _mm256_srli_epi16(qv_sum_shift, 8);
const __m256i qv_sum_hi       = _mm256_and_si256(hi_mask_16, qv_sum_hi_dirty);

const __m256i qv_sum_16       = _mm256_or_si256(qv_sum_lo, qv_sum_hi);

The procedure will be identical for both signed and unsigned 4-bit values. Now, qv_sum_16 contains two consecutive 4-bit values, stored in the low-bits of a 16-bit memory chunk. Assuming that we have obtained qv_sum_16 from the next iteration (call it qv_sum_16_next), we can pack everything with:

const __m256i qv_sum_pack = _mm256_packus_epi16(qv_sum_16, qv_sum_16_next);
const __m256i result      = _mm256_permute4x64_epi64(qv_sum_pack, 0xD8);

Alternatively, if we want to produce only 32 x 4-bit values, we can do the following:

const __m128i lo = _mm256_extractf128_si256(qv_sum_16, 0);
const __m128i hi = _mm256_extractf128_si256(qv_sum_16, 1);
const __m256i result = _mm_packus_epi16(lo, hi)

Putting it all together

Assuming signed nibbles, and vector size n, such that n is larger than 128 elements and is multiple of 128, we can perform the odd-even addition, producing n/2 elements as follows:

void add_odd_even(uint64_t n, int8_t * v, int8_t * r)
{
    //
    // Make sure that the vector size that is a multiple of 128
    //
    assert(n % 128 == 0);
    const uint64_t blocks = n / 64;
    //
    // Define constants that will be used for masking operations
    //
    const __m256i hi_mask_08 = _mm256_set1_epi8(-16);
    const __m256i lo_mask_16 = _mm256_set1_epi16(0x0F);
    const __m256i hi_mask_16 = _mm256_set1_epi16(0xF0);

    for (uint64_t b = 0; b < blocks; b += 2) {
        //
        // Calculate the offsets
        //
        const uint64_t offset0 = b * 32;
        const uint64_t offset1 = b * 32 + 32;
        const uint64_t offset2 = b * 32 / 2;
        //
        // Load 128 values in two AVX registers. Each register will
        // contain 64 x 4-bit values in the range [-8, 7].
        //
        const __m256i qv_1 = _mm256_loadu_si256((__m256i *) (v + offset0));
        const __m256i qv_2 = _mm256_loadu_si256((__m256i *) (v + offset1));
        //
        // Extract the odd and the even parts. The values will be split in
        // two registers qv_odd_shift and qv_evn_shift, each of them having
        // 32 x 8-bit values, such that each value is multiplied by 2^4
        // and resides in the range [-8 * 2^4, 7 * 2^4]
        //
        const __m256i qv_odd_dirty_1 = _mm256_slli_epi16(qv_1, 4);
        const __m256i qv_odd_shift_1 = _mm256_and_si256(hi_mask_08, qv_odd_dirty_1);
        const __m256i qv_evn_shift_1 = _mm256_and_si256(hi_mask_08, qv_1);
        const __m256i qv_odd_dirty_2 = _mm256_slli_epi16(qv_2, 4);
        const __m256i qv_odd_shift_2 = _mm256_and_si256(hi_mask_08, qv_odd_dirty_2);
        const __m256i qv_evn_shift_2 = _mm256_and_si256(hi_mask_08, qv_2);
        //
        // Perform addition. In case of overflows / underflows, behaviour
        // is undefined. Values are still in the range [-8 * 2^4, 7 * 2^4].
        //
        const __m256i qv_sum_shift_1 = _mm256_add_epi8(qv_odd_shift_1, qv_evn_shift_1);
        const __m256i qv_sum_shift_2 = _mm256_add_epi8(qv_odd_shift_2, qv_evn_shift_2);
        //
        // Divide by 2^4. At this point in time, each of the two AVX registers holds
        // 32 x 8-bit values that are in the range of [-8, 7]. Summation is complete.
        //
        const __m256i qv_sum_1 = _mm256_srai_epi16(qv_sum_shift_1, 4);
        const __m256i qv_sum_2 = _mm256_srai_epi16(qv_sum_shift_2, 4);
        //
        // Now, we want to take the even numbers of the 32 x 4-bit register, and
        // store them in the high-bits of the odd numbers. We do this with
        // left shifts that extend in zero, and 16-bit masks. This operation
        // results in two registers qv_sum_lo and qv_sum_hi that hold 32
        // values. However, each consecutive 4-bit values reside in the
        // low-bits of a 16-bit chunk.
        //
        const __m256i qv_sum_1_lo       = _mm256_and_si256(lo_mask_16, qv_sum_1);
        const __m256i qv_sum_1_hi_dirty = _mm256_srli_epi16(qv_sum_shift_1, 8);
        const __m256i qv_sum_1_hi       = _mm256_and_si256(hi_mask_16, qv_sum_1_hi_dirty);
        const __m256i qv_sum_2_lo       = _mm256_and_si256(lo_mask_16, qv_sum_2);
        const __m256i qv_sum_2_hi_dirty = _mm256_srli_epi16(qv_sum_shift_2, 8);
        const __m256i qv_sum_2_hi       = _mm256_and_si256(hi_mask_16, qv_sum_2_hi_dirty);
        const __m256i qv_sum_16_1       = _mm256_or_si256(qv_sum_1_lo, qv_sum_1_hi);
        const __m256i qv_sum_16_2       = _mm256_or_si256(qv_sum_2_lo, qv_sum_2_hi);
        //
        // Pack the two registers of 32 x 4-bit values, into a single one having
        // 64 x 4-bit values. Use the unsigned version, to avoid saturation.
        //
        const __m256i qv_sum_pack = _mm256_packus_epi16(qv_sum_16_1, qv_sum_16_2);
        //
        // Interleave the 64-bit chunks.
        //
        const __m256i qv_sum = _mm256_permute4x64_epi64(qv_sum_pack, 0xD8);
        //
        // Store the result
        //
        _mm256_storeu_si256((__m256i *)(r + offset2), qv_sum);
    }
}

A self-contained tester and validator of this code is available in the gist here.

Multiplying Odd and Even 4-bit entries

For the multiplication of the odd and even entries, we can use the same strategy as described above to extract the 4-bits into larger chunks.

AVX2 does not offer 8-bit multiplication, only 16-bit. However, we can implement 8-bit multiplication following the method implemented in the Agner Fog's C++ vector class library:

static inline Vec32c operator * (Vec32c const & a, Vec32c const & b) {
    // There is no 8-bit multiply in SSE2. Split into two 16-bit multiplies
    __m256i aodd    = _mm256_srli_epi16(a,8);         // odd numbered elements of a
    __m256i bodd    = _mm256_srli_epi16(b,8);         // odd numbered elements of b
    __m256i muleven = _mm256_mullo_epi16(a,b);        // product of even numbered elements
    __m256i mulodd  = _mm256_mullo_epi16(aodd,bodd);  // product of odd  numbered elements
            mulodd  = _mm256_slli_epi16(mulodd,8);    // put odd numbered elements back in place
    __m256i mask    = _mm256_set1_epi32(0x00FF00FF);  // mask for even positions
    __m256i product = selectb(mask,muleven,mulodd);   // interleave even and odd
    return product;
}

I would suggest however to extract the nibbles into 16-bit chunks first and then use _mm256_mullo_epi16 to avoid performing unnecessary shifts.



来源:https://stackoverflow.com/questions/44011366/avx-4-bit-integers

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!