memcpy performance differences between 32 and 64 bit processes

后端 未结 7 1877
梦如初夏
梦如初夏 2020-12-14 23:43

We have Core2 machines (Dell T5400) with XP64.

We observe that when running 32-bit processes, the performance of memcpy is on the order of 1.2GByte/s; however memcpy

7条回答
  •  轻奢々
    轻奢々 (楼主)
    2020-12-15 00:13

    I finally got to the bottom of this (and Die in Sente's answer was on the right lines, thanks)

    In the below, dst and src are 512 MByte std::vector. I'm using the Intel 10.1.029 compiler and CRT.

    On 64bit both

    memcpy(&dst[0],&src[0],dst.size())

    and

    memcpy(&dst[0],&src[0],N)

    where N is previously declared const size_t N=512*(1<<20); call

    __intel_fast_memcpy

    the bulk of which consists of:

      000000014004ED80  lea         rcx,[rcx+40h] 
      000000014004ED84  lea         rdx,[rdx+40h] 
      000000014004ED88  lea         r8,[r8-40h] 
      000000014004ED8C  prefetchnta [rdx+180h] 
      000000014004ED93  movdqu      xmm0,xmmword ptr [rdx-40h] 
      000000014004ED98  movdqu      xmm1,xmmword ptr [rdx-30h] 
      000000014004ED9D  cmp         r8,40h 
      000000014004EDA1  movntdq     xmmword ptr [rcx-40h],xmm0 
      000000014004EDA6  movntdq     xmmword ptr [rcx-30h],xmm1 
      000000014004EDAB  movdqu      xmm2,xmmword ptr [rdx-20h] 
      000000014004EDB0  movdqu      xmm3,xmmword ptr [rdx-10h] 
      000000014004EDB5  movntdq     xmmword ptr [rcx-20h],xmm2 
      000000014004EDBA  movntdq     xmmword ptr [rcx-10h],xmm3 
      000000014004EDBF  jge         000000014004ED80 
    

    and runs at ~2200 MByte/s.

    But on 32bit

    memcpy(&dst[0],&src[0],dst.size())

    calls

    __intel_fast_memcpy

    the bulk of which consists of

      004447A0  sub         ecx,80h 
      004447A6  movdqa      xmm0,xmmword ptr [esi] 
      004447AA  movdqa      xmm1,xmmword ptr [esi+10h] 
      004447AF  movdqa      xmmword ptr [edx],xmm0 
      004447B3  movdqa      xmmword ptr [edx+10h],xmm1 
      004447B8  movdqa      xmm2,xmmword ptr [esi+20h] 
      004447BD  movdqa      xmm3,xmmword ptr [esi+30h] 
      004447C2  movdqa      xmmword ptr [edx+20h],xmm2 
      004447C7  movdqa      xmmword ptr [edx+30h],xmm3 
      004447CC  movdqa      xmm4,xmmword ptr [esi+40h] 
      004447D1  movdqa      xmm5,xmmword ptr [esi+50h] 
      004447D6  movdqa      xmmword ptr [edx+40h],xmm4 
      004447DB  movdqa      xmmword ptr [edx+50h],xmm5 
      004447E0  movdqa      xmm6,xmmword ptr [esi+60h] 
      004447E5  movdqa      xmm7,xmmword ptr [esi+70h] 
      004447EA  add         esi,80h 
      004447F0  movdqa      xmmword ptr [edx+60h],xmm6 
      004447F5  movdqa      xmmword ptr [edx+70h],xmm7 
      004447FA  add         edx,80h 
      00444800  cmp         ecx,80h 
      00444806  jge         004447A0
    

    and runs at ~1350 MByte/s only.

    HOWEVER

    memcpy(&dst[0],&src[0],N)
    

    where N is previously declared const size_t N=512*(1<<20); compiles (on 32bit) to a direct call to a

    __intel_VEC_memcpy
    

    the bulk of which consists of

      0043FF40  movdqa      xmm0,xmmword ptr [esi] 
      0043FF44  movdqa      xmm1,xmmword ptr [esi+10h] 
      0043FF49  movdqa      xmm2,xmmword ptr [esi+20h] 
      0043FF4E  movdqa      xmm3,xmmword ptr [esi+30h] 
      0043FF53  movntdq     xmmword ptr [edi],xmm0 
      0043FF57  movntdq     xmmword ptr [edi+10h],xmm1 
      0043FF5C  movntdq     xmmword ptr [edi+20h],xmm2 
      0043FF61  movntdq     xmmword ptr [edi+30h],xmm3 
      0043FF66  movdqa      xmm4,xmmword ptr [esi+40h] 
      0043FF6B  movdqa      xmm5,xmmword ptr [esi+50h] 
      0043FF70  movdqa      xmm6,xmmword ptr [esi+60h] 
      0043FF75  movdqa      xmm7,xmmword ptr [esi+70h] 
      0043FF7A  movntdq     xmmword ptr [edi+40h],xmm4 
      0043FF7F  movntdq     xmmword ptr [edi+50h],xmm5 
      0043FF84  movntdq     xmmword ptr [edi+60h],xmm6 
      0043FF89  movntdq     xmmword ptr [edi+70h],xmm7 
      0043FF8E  lea         esi,[esi+80h] 
      0043FF94  lea         edi,[edi+80h] 
      0043FF9A  dec         ecx  
      0043FF9B  jne         ___intel_VEC_memcpy+244h (43FF40h) 
    

    and runs at ~2100MByte/s (and proving 32bit isn't somehow bandwidth limited).

    I withdraw my claim that my own memcpy-like SSE code suffers from a similar ~1300 MByte/limit in 32bit builds; I now don't have any problems getting >2GByte/s on 32 or 64bit; the trick (as the above results hint) is to use non-temporal ("streaming") stores (e.g _mm_stream_ps intrinsic).

    It seems a bit strange that the 32bit "dst.size()" memcpy doesn't eventually call the faster "movnt" version (if you step into memcpy there is the most incredible amount of CPUID checking and heuristic logic e.g comparing number of bytes to be copied with cache size etc before it goes anywhere near your actual data) but at least I understand the observed behaviour now (and it's not SysWow64 or H/W related).

提交回复
热议问题