What efficient way to load x64 ymm register with 4 seperated doubles?

瘦欲@ 提交于 2020-01-03 02:54:59

问题


What is the most efficient way to load a x64 ymm register with

  1. 4 doubles evenly spaced i.e. a contiguous set of doubles

    0  1  2  3  4  5  6  7  8  9 10 .. 100
    And i want to load for example 0, 10, 20, 30
    
  2. 4 doubles at any position

    i.e. i want to load for example 1, 6, 22, 43
    

回答1:


The simplest approach is VGATHERQPD which is an AVX2 instruction available on Haswell and up.

VGATHERQPD ymm1, [rsi+xmm7*8], ymm2

Using dword indices specified in vm32x, gather double-pre-cision FP values from memory conditioned on mask specified by ymm2. Conditionally gathered elements are merged into ymm1.

which can achieve this with one instruction. Here ymm2 is the mask register with the highest bit indicating if the value should be copied to ymm1 or not(left unchanged). ymm7 contains the indices of the elements with the scale factor.

So applied to your examples, it could look like this in MASM syntax:

4 doubles evenly spaced i.e. a contiguous set of doubles

0 1 2 3 4 5 6 7 8 9 10 .. 100 --- And i want to load for example 0, 10, 20, 30

.data
  .align 16
  qqIndices dq 0,10,20,30
  dpValues  REAL8 0,1,2,3, ... 100
.code
  lea rsi, dpValues
  movapd ymm7, qqIndices
  vpcmpeqw ymm1, ymm1                     ; set to all ones
  vgatherqpd ymm0, [rsi+xmm7*8], ymm1

Now ymm0 contains four doubles 0, 10, 20, 30. Though, I haven't tested this yet. Another thing to mention is, that this is not necessarily the fastest choice in every scenario. The values are all gathered separately, that means, each value needs one memory access, see How are the gather instructions in AVX2 implemented

So according to Mysticial's comment

I recently had to do something that required a true gather-load. (i.e. data[index[i]]). On Haswell, 4 index loads + 2x movsd + 2x movhpd + vinsertf128 is still significantly faster than a ymm load + vgatherqpd. So even in the best case scenario, 4-way gather still loses. I haven't tried 8-way gather though.

the fastest way would be using that approach.

So "efficient" in an OpCode way would be using VGATHER and "efficient" relating to execution time would be the last one (so far, let's see how future architectures will perform).

EDIT: according to comments the VGATHER instructions get faster on Broadwell and Skylake.




回答2:


I think that you have to look for GATHER operation like VGATHERQPD.

The instruction conditionally loads up to 2 or 4 double-precision floating-point values from memory addresses specified by the memory operand (the second operand) and using qword indices. The memory operand uses the VSIB form of the SIB byte to specify a general purpose register operand as the common base, a vector register for an array of indices relative to the base and a constant scale factor.

Note that this requires AVX2, so is not applicable to Sandy Bridge/Ivy Bridge which have AVX but not AVX2.



来源:https://stackoverflow.com/questions/35356533/what-efficient-way-to-load-x64-ymm-register-with-4-seperated-doubles

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!