We have recently purchased some new servers and are experiencing poor memcpy performance. The memcpy performance is 3x slower on the servers compared to our laptops.
Server 1 Specs
- CPU: 2x Intel Xeon E5-2680 @ 2.70 Ghz
Server 2 Specs
- CPU: 2x Intel Xeon E5-2650 v2 @ 2.6 Ghz
According to Intel ARK, both the E5-2650 and E5-2680 have AVX extension.
CMake File to Build
This is part of your problem. CMake chooses some rather poor flags for you. You can confirm it by running make VERBOSE=1
.
You should add both -march=native
and -O3
to your CFLAGS
and CXXFLAGS
. You will likely see a dramatic performance increase. It should engage the AVX extensions. Without -march=XXX
, you effectively get a minimal i686 or x86_64 machine. Without -O3
, you don't engage GCC's vectorizations.
I'm not sure if GCC 4.6 is capable of AVX (and friends, like BMI). I know GCC 4.8 or 4.9 is capable because I had to hunt down an alignment bug that was causing a segfault when GCC was outsourcing memcpy's and memset's to the MMX unit. AVX and AVX2 allow the CPU to operate on 16-byte and 32-byte blocks of data at a time.
If GCC is missing an opportunity to send aligned data to the MMX unit, it may be missing the fact that data is aligned. If your data is 16-byte aligned, then you might try telling GCC so it knows to operate on fat blocks. For that, see GCC's __builtin_assume_aligned. Also see questions like How to tell GCC that a pointer argument is always double-word-aligned?
This also looks a little suspect because of the void*
. Its kind of throwing away information about the pointer. You should probably keep the information:
void doMemmove(void* pDest, const void* pSource, std::size_t sizeBytes)
{
memmove(pDest, pSource, sizeBytes);
}
Maybe something like the following:
template
void doMemmove(T* pDest, const T* pSource, std::size_t count)
{
memmove(pDest, pSource, count*sizeof(T));
}
Another suggestion is to use new
, and stop using malloc
. Its a C++ program and GCC can make some assumptions about new
that it cannot make about malloc
. I believe some of the assumptions are detailed in GCC's option page for the built-ins.
Yet another suggestion is to use the heap. Its always 16-byte aligned on typical modern systems. GCC should recognize it can offload to the MMX unit when a pointer from the heap is involved (sans the potential void*
and malloc
issues).
Finally, for a while, Clang was not using the native CPU extensions when using -march=native
. See, for example, Ubuntu Issue 1616723, Clang 3.4 only advertises SSE2, Ubuntu Issue 1616723, Clang 3.5 only advertises SSE2, and Ubuntu Issue 1616723, Clang 3.6 only advertises SSE2.