问题
I am using _mm_extract_epi8 (__m128i a, const int imm8)
function, which has const int
parameter. When I compile this c++ code, getting the following error message:
Error C2057 expected constant expression
__m128i a;
for (int i=0; i<16; i++)
{
_mm_extract_epi8(a, i); // compilation error
}
How could I use this function in loop?
回答1:
First of all, you wouldn't want to use it in a loop even if it was possible, and you wouldn't want to fully unroll a loop with 16x pextrb
. That instruction costs 2 uops on Intel and AMD CPUs, and will bottleneck on the shuffle port (and port 0 for vec->int data transfer).
The _mm_extract_epi8
intrinsic requires a compile-time constant index because the pextrb r32/m8, xmm, imm8 instruction is only available with the index as an immediate (embedded into the machine code of the instruction).
If you want to give up on SIMD and write a scalar loop over vector elements, for this many elements you should store/reload. So you should write it that way in C++:
alignas(16) int8_t bytes[16]; // or uint8_t
_mm_store_si128((__m128i*)bytes, vec);
for(int i=0 ; i<16 ; i++) {
foo(bytes[i]);
}
The cost of one store (and the store-forwarding latency) is amortized over 16 reloads which only cost 1 movsx eax, byte ptr [rsp+16]
or whatever each. (1 uop on Intel and Ryzen). Or use uint8_t
for movzx
zero-extension to 32-bit in the reloads. Modern CPUs can run 2 load uops per clock, and vector-store -> scalar reload store forwarding is efficient (~6 or 7 cycle latency).
With 64-bit elements, movq
+ pextrq
is almost certainly your best bet. Store + reloads are comparable cost for the front-end and worse latency than extract.
With 32-bit elements, it's closer to break even depending on your loop. An unrolled ALU extract could be good if the loop body is small. Or you might store/reload but do do the first element with _mm_cvtsi128_si32
(movd
) for low latency on the first element so the CPU can be working on that while the store-forwarding latency for the high elements happens.
With 16-bit or 8-bit elements, it's almost certainly better to store/reload if you need to loop over all 8 or 16 elements.
If your loop makes a non-inline function call for each element, the Windows x64 calling convention has some call-preserved XMM registers, but x86-64 System V doesn't. So if your XMM reg would need to be spilled/reloaded around a function call, it's much better to just do scalar loads since the compiler will have it in memory anyway. (Hopefully it can optimize away the 2nd copy of it, or you could declare a union.)
See print a __m128i variable for working store + scalar loops for all element sizes.
If you actually want a horizontal sum, or min or max, you can do it with shuffles in O(log n) steps, rather than n scalar loop iterations. Fastest way to do horizontal float vector sum on x86 (also mentions 32-bit integer).
And for summing byte elements, SSE2 has a special case of _mm_sad_epu8(vec, _mm_setzero_si128())
. Sum reduction of unsigned bytes without overflow, using SSE2 on Intel.
You can also use that to do signed bytes by range-shifting to unsigned and then subtracting 16*0x80
from the sum. https://github.com/pcordes/vectorclass/commit/630ca802bb1abefd096907f8457d090c28c8327b
回答2:
Intrinsic _mm_extract_epi8()
cannot be used with variable indices,
as already pointed out in the comments.
You can use the solution below instead,
but I would use this solution only in a non-performance critical loop,
such as, for example, printing results to file or screen.
Actually, in practice it is almost never necessary to loop over the
byte elements of an xmm
. For example, the following operations on epi8
do not need
a loop over the elements (the examples may contain some self promotion):
- Horizontal minimum, maximum, sum, sum of absolute values, root mean square, avarage, bitand, bitor.
- Prefix sum.
- Computing the most frequentlty occurring element (the modus).
- Variabele bit shift.
- Create a mask based on byte values..
- Computing the indices of the nonzero elements.
- Etc. ect.
In these cases efficient vectorized solutions are possible.
If you cannot avoid a loop over the elements in a performance critical loop: Peter Cordes' solution should be faster than the one below, at least if you have to extract many (2 or more) elements.
#include <stdio.h>
#include <stdint.h>
#include <immintrin.h>
/* gcc -m64 -O3 -march=nehalem extr_byte.c */
uint8_t mm_extract_epi8_var_indx(__m128i vec, int i )
{
__m128i indx = _mm_cvtsi32_si128(i);
__m128i val = _mm_shuffle_epi8(vec, indx);
return (uint8_t)_mm_cvtsi128_si32(val);
}
int main()
{
int i;
__m128i x = _mm_set_epi8(36,35,34,33, 32,31,30, 29,28,27,26, 25,24,23,22,21);
uint8_t t;
for (i = 0; i < 16; i++){
printf("x_%i = ", i);
t = mm_extract_epi8_var_indx(x, i);
printf("%i \n", t);
}
return 0;
}
Result:
$ ./a.out
x_0 = 21
x_1 = 22
x_2 = 23
x_3 = 24
x_4 = 25
x_5 = 26
x_6 = 27
x_7 = 28
x_8 = 29
x_9 = 30
x_10 = 31
x_11 = 32
x_12 = 33
x_13 = 34
x_14 = 35
x_15 = 36
来源:https://stackoverflow.com/questions/54492956/how-to-use-mm-extract-epi8-function