How to use vindex and scale with _mm_i32gather_epi32 to gather elements? [duplicate]

孤者浪人 提交于 2019-12-14 03:29:51

问题


Intel's Intrinsic Guide says:

__m128i _mm_i32gather_epi32 (int const* base_addr, __m128i vindex, const int scale)

And:

Description

Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

Operation

FOR j := 0 to 3
  i := j*32
  dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale]
ENDFOR
dst[MAX:128] := 0

If I am parsing things correctly then vindex (with scale) are the indexes into base_addr used to create the __m128i result.

Below I am trying to create val = arr[1] << 96 | arr[5] << 64 | arr[9] << 32 | arr[13] << 0. That is, starting at 1 take every 4th element.

$ cat -n gather.cxx
 1  #include <immintrin.h>
 2  typedef unsigned int u32;
 3  int main(int argc, char* argv[])
 4  {
 5          u32 arr[16] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
 6          __m128i idx = _mm_set_epi32(1,5,9,13);
 7          __m128i val = _mm_i32gather_epi32(arr, idx, 1);
 8          return 0;
 9   }

But when I examine val:

(gdb) n
6               __m128i idx = _mm_set_epi32(1,5,9,13);
(gdb) n
7               __m128i val = _mm_i32gather_epi32(arr, idx, 1);
(gdb) n
8               return 0;
(gdb) p val
$1 = {0x300000004000000, 0x100000002000000}

It appears I am using vindex incorrectly. It appears I am selecting indices 1,2,3,4.

How do I use vindex and scale to select array indices 1,5,9,13?


回答1:


Your array elements are 4 bytes wide. Therefore use a scale factor of 4 in the VSIB addressing mode when using element indices instead of byte offsets.

The int const* base_addr argument has type int, but at no point is any C pointer math done with it. It's fed directly to the asm instruction, so you need to take care of byte offsets. (And hopefully also taking care of strict aliasing in case you want to grab dwords out of a uint64_t[] or char[].) It could just as well be a const void*.

If the intrinsic multiplied your scale factor by 4, you wouldn't be able to use it with byte offsets, only with int indices. The asm instruction can scale by 1,2,4, or 8, using the usual x86 addressing mode encoding: a 2 bit shift count.


A strided index with a stride of 4, starting at 1, gets zeros everywhere except the high byte of each element. i.e. it's offset by 1 byte from the the start of the array, and x86 is little endian.

Notice that you didn't get 1,2,3,4, you got 1<<24, 2<<24, etc. Printing as one big 64-bit integer makes that harder to spot.

With that source change of scale = 1 -> 4, your gather is an identity mapping:

(gdb) p  $xmm7.v4_int32
$2 = {13, 9, 5, 1}

I'm not sure if GDB has a convenient way to print the elements of a __m128i variable without knowing what register it's in.



来源:https://stackoverflow.com/questions/50883785/how-to-use-vindex-and-scale-with-mm-i32gather-epi32-to-gather-elements

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!