x86 MUL Instruction from VS 2008/2010

后端 未结 6 1583
鱼传尺愫
鱼传尺愫 2020-11-29 09:29

Will modern (2008/2010) incantations of Visual Studio or Visual C++ Express produce x86 MUL instructions (unsigned multiply) in the compiled code? I cannot seem to find or

6条回答
  •  温柔的废话
    2020-11-29 09:58

    Right after I looked at this question I found MULQ in my generated code when dividing.

    The full code is turning a large binary number into chunks of a billion so that it can be easily converted to a string.

    C++ code:

    for_each(TempVec.rbegin(), TempVec.rend(), [&](Short & Num){
        Remainder <<= 32;
        Remainder += Num;
        Num = Remainder / 1000000000;
        Remainder %= 1000000000;//equivalent to Remainder %= DecimalConvert
    });
    

    Optimized Generated Assembly

    00007FF7715B18E8  lea         r9,[rsi-4]  
    00007FF7715B18EC  mov         r13,12E0BE826D694B2Fh  
    00007FF7715B18F6  nop         word ptr [rax+rax] 
    00007FF7715B1900  shl         r8,20h  
    00007FF7715B1904  mov         eax,dword ptr [r9]  
    00007FF7715B1907  add         r8,rax  
    00007FF7715B190A  mov         rax,r13  
    00007FF7715B190D  mul         rax,r8  
    00007FF7715B1910  mov         rcx,r8  
    00007FF7715B1913  sub         rcx,rdx  
    00007FF7715B1916  shr         rcx,1  
    00007FF7715B1919  add         rcx,rdx  
    00007FF7715B191C  shr         rcx,1Dh  
    00007FF7715B1920  imul        rax,rcx,3B9ACA00h  
    00007FF7715B1927  sub         r8,rax  
    00007FF7715B192A  mov         dword ptr [r9],ecx  
    00007FF7715B192D  lea         r9,[r9-4]  
    00007FF7715B1931  lea         rax,[r9+4]  
    00007FF7715B1935  cmp         rax,r14  
    00007FF7715B1938  jne         NumToString+0D0h (07FF7715B1900h)  
    

    Notice the MUL instruction 5 lines down. This generated code is extremely unintuitive, I know, in fact it looks nothing like the compiled code but DIV is extremely slow ~25 cycles for a 32 bit div, and ~75 according to this chart on modern PCs compared with MUL or IMUL (around 3 or 4 cycles) and so it makes sense to try to get rid of DIV even if you have to add all sorts of extra instructions.

    I don't fully understand the optimization here, but if you would like to see a rational and a mathematical explanation of using compile time and multiplication to divide constants, see this paper.

    This is an example of is the compiler making use of the performance and capability of the full 64 by 64 bit untruncated multiply without showing the c++ coder any sign of it.

提交回复
热议问题