Why is the AVX-256 VMOVAPS Instruction only copying four single precision floats instead of 8?

问题

I am trying to familiarize myself with the 256-bit AVX instructions available on some of the newer Intel processors. I have already verified that my i7-4720HQ supports 256-bit AVX instructions. The problem I am having is that the VMOVAPS instruction, which should copy 8 single precision floating point values, is only copying 4.

dot PROC
    VMOVAPS YMM1, ymmword ptr [RCX]                
    VDPPS   YMM2, YMM1, ymmword ptr [RDX], 255      
    VMOVAPS ymmword ptr [RCX], YMM2                 
    MOVSS   XMM0, DWORD PTR [RCX]                  
    RET
dot ENDP

In case you aren't familiar with the calling convention, Visual C++ 2015 expects the return of this function (since it is a float) to be in XMM0 upon return.

In addition to this, the standard is for the first argument to be passed in RCX and the second argument to be passed in RDX.

Here is the C code that calls this function.

_declspec(align(32)) float d1[] = { 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f };
_declspec(align(32)) float d2[] = { 2.0f, 2.0f, 2.0f, 2.0f, 2.0f, 2.0f, 2.0f, 2.0f };
printf("Dot Product Test: %f\n", dot(d1, d2));

The return value of the dot function is always 8.0. In addition to this, I have debugged the function and found that after the first assembly instruction, only four values get copied into YMM1. The rest of YMM1 remains zeroed.

Am I doing something wrong here? I've looked through the Intel documentation and some third party documentation. As far as I can tell I'm doing everything right. Am I using the wrong instruction? By the way, if you are here to tell me to use the Intel compiler intrinsics, don't bother.

回答1:

You forgot to read the instruction set reference page for VDPPS. It mentions that the result is produced in two halves:

VDPPS (VEX.256 encoded version)
DEST[127:0] ← DP_Primitive(SRC1[127:0], SRC2[127:0]);
DEST[255:128] ← DP_Primitive(SRC1[255:128], SRC2[255:128]);

It's not the VMOVAPS that's wrong.

回答2:

I just updated to visual studio 2015 update two, and now it is working properly. I have no idea why. My best guess is that MASM was converting my AVX256 code into AVX128 code for no good reason. Either way, problem solved.

来源：https://stackoverflow.com/questions/36798584/why-is-the-avx-256-vmovaps-instruction-only-copying-four-single-precision-floats

标签

assembly

intel

avx

avx2