问题
What is the most efficient way to load a x64 ymm register with
4 doubles evenly spaced i.e. a contiguous set of doubles
0 1 2 3 4 5 6 7 8 9 10 .. 100 And i want to load for example 0, 10, 20, 304 doubles at any position
i.e. i want to load for example 1, 6, 22, 43
回答1:
The simplest approach is VGATHERQPD which is an AVX2 instruction available on Haswell and up.
VGATHERQPD ymm1, [rsi+xmm7*8], ymm2
Using dword indices specified in vm32x, gather double-pre-cision FP values from memory conditioned on mask specified by ymm2. Conditionally gathered elements are merged into ymm1.
which can achieve this with one instruction.
Here ymm2 is the mask register with the highest bit indicating if the value should be copied to ymm1 or not(left unchanged).
ymm7 contains the indices of the elements with the scale factor.
So applied to your examples, it could look like this in MASM syntax:
4 doubles evenly spaced i.e. a contiguous set of doubles
0 1 2 3 4 5 6 7 8 9 10 .. 100 --- And i want to load for example 0, 10, 20, 30
.data
.align 16
qqIndices dq 0,10,20,30
dpValues REAL8 0,1,2,3, ... 100
.code
lea rsi, dpValues
movapd ymm7, qqIndices
vpcmpeqw ymm1, ymm1 ; set to all ones
vgatherqpd ymm0, [rsi+xmm7*8], ymm1
Now ymm0 contains four doubles 0, 10, 20, 30.
Though, I haven't tested this yet. Another thing to mention is, that this is not necessarily the fastest choice in every scenario. The values are all gathered separately, that means, each value needs one memory access, see How are the gather instructions in AVX2 implemented
So according to Mysticial's comment
I recently had to do something that required a true gather-load. (i.e. data[index[i]]). On Haswell,
4 index loads + 2x movsd + 2x movhpd + vinsertf128is still significantly faster than aymm load + vgatherqpd. So even in the best case scenario, 4-way gather still loses. I haven't tried 8-way gather though.
the fastest way would be using that approach.
So "efficient" in an OpCode way would be using VGATHER and "efficient" relating to execution time would be the last one (so far, let's see how future architectures will perform).
EDIT: according to comments the VGATHER instructions get faster on Broadwell and Skylake.
回答2:
I think that you have to look for GATHER operation like VGATHERQPD.
The instruction conditionally loads up to 2 or 4 double-precision floating-point values from memory addresses specified by the memory operand (the second operand) and using qword indices. The memory operand uses the VSIB form of the SIB byte to specify a general purpose register operand as the common base, a vector register for an array of indices relative to the base and a constant scale factor.
Note that this requires AVX2, so is not applicable to Sandy Bridge/Ivy Bridge which have AVX but not AVX2.
来源:https://stackoverflow.com/questions/35356533/what-efficient-way-to-load-x64-ymm-register-with-4-seperated-doubles