Poor memcpy Performance on Linux

后端 未结 7 2014
执笔经年
执笔经年 2021-01-29 21:04

We have recently purchased some new servers and are experiencing poor memcpy performance. The memcpy performance is 3x slower on the servers compared to our laptops.

7条回答
  •  独厮守ぢ
    2021-01-29 21:28

    Server 1 Specs

    • CPU: 2x Intel Xeon E5-2680 @ 2.70 Ghz

    Server 2 Specs

    • CPU: 2x Intel Xeon E5-2650 v2 @ 2.6 Ghz

    According to Intel ARK, both the E5-2650 and E5-2680 have AVX extension.

    CMake File to Build

    This is part of your problem. CMake chooses some rather poor flags for you. You can confirm it by running make VERBOSE=1.

    You should add both -march=native and -O3 to your CFLAGS and CXXFLAGS. You will likely see a dramatic performance increase. It should engage the AVX extensions. Without -march=XXX, you effectively get a minimal i686 or x86_64 machine. Without -O3, you don't engage GCC's vectorizations.

    I'm not sure if GCC 4.6 is capable of AVX (and friends, like BMI). I know GCC 4.8 or 4.9 is capable because I had to hunt down an alignment bug that was causing a segfault when GCC was outsourcing memcpy's and memset's to the MMX unit. AVX and AVX2 allow the CPU to operate on 16-byte and 32-byte blocks of data at a time.

    If GCC is missing an opportunity to send aligned data to the MMX unit, it may be missing the fact that data is aligned. If your data is 16-byte aligned, then you might try telling GCC so it knows to operate on fat blocks. For that, see GCC's __builtin_assume_aligned. Also see questions like How to tell GCC that a pointer argument is always double-word-aligned?

    This also looks a little suspect because of the void*. Its kind of throwing away information about the pointer. You should probably keep the information:

    void doMemmove(void* pDest, const void* pSource, std::size_t sizeBytes)
    {
      memmove(pDest, pSource, sizeBytes);
    }
    

    Maybe something like the following:

    template 
    void doMemmove(T* pDest, const T* pSource, std::size_t count)
    {
      memmove(pDest, pSource, count*sizeof(T));
    }
    

    Another suggestion is to use new, and stop using malloc. Its a C++ program and GCC can make some assumptions about new that it cannot make about malloc. I believe some of the assumptions are detailed in GCC's option page for the built-ins.

    Yet another suggestion is to use the heap. Its always 16-byte aligned on typical modern systems. GCC should recognize it can offload to the MMX unit when a pointer from the heap is involved (sans the potential void* and malloc issues).

    Finally, for a while, Clang was not using the native CPU extensions when using -march=native. See, for example, Ubuntu Issue 1616723, Clang 3.4 only advertises SSE2, Ubuntu Issue 1616723, Clang 3.5 only advertises SSE2, and Ubuntu Issue 1616723, Clang 3.6 only advertises SSE2.

提交回复
热议问题