What efficient way to load x64 ymm register with 4 seperated doubles?
What is the most efficient way to load a x64 ymm register with 4 doubles evenly spaced i.e. a contiguous set of doubles 0 1 2 3 4 5 6 7 8 9 10 .. 100 And i want to load for example 0, 10, 20, 30 4 doubles at any position i.e. i want to load for example 1, 6, 22, 43 zx485 The simplest approach is VGATHERQPD which is an AVX2 instruction available on Haswell and up. VGATHERQPD ymm1, [rsi+xmm7*8], ymm2 Using dword indices specified in vm32x, gather double-pre-cision FP values from memory conditioned on mask specified by ymm2. Conditionally gathered elements are merged into ymm1. which can achieve