Poor memcpy Performance on Linux

后端未结

关注

 7  2014

执笔经年 2021-01-29 21:04

We have recently purchased some new servers and are experiencing poor memcpy performance. The memcpy performance is 3x slower on the servers compared to our laptops.

7条回答

独厮守ぢ (楼主)

2021-01-29 21:28
Server 1 Specs
- CPU: 2x Intel Xeon E5-2680 @ 2.70 Ghz
Server 2 Specs
- CPU: 2x Intel Xeon E5-2650 v2 @ 2.6 Ghz
According to Intel ARK, both the E5-2650 and E5-2680 have AVX extension.

CMake File to Build

This is part of your problem. CMake chooses some rather poor flags for you. You can confirm it by running make VERBOSE=1.

You should add both -march=native and -O3 to your CFLAGS and CXXFLAGS. You will likely see a dramatic performance increase. It should engage the AVX extensions. Without -march=XXX, you effectively get a minimal i686 or x86_64 machine. Without -O3, you don't engage GCC's vectorizations.

I'm not sure if GCC 4.6 is capable of AVX (and friends, like BMI). I know GCC 4.8 or 4.9 is capable because I had to hunt down an alignment bug that was causing a segfault when GCC was outsourcing memcpy's and memset's to the MMX unit. AVX and AVX2 allow the CPU to operate on 16-byte and 32-byte blocks of data at a time.

If GCC is missing an opportunity to send aligned data to the MMX unit, it may be missing the fact that data is aligned. If your data is 16-byte aligned, then you might try telling GCC so it knows to operate on fat blocks. For that, see GCC's __builtin_assume_aligned. Also see questions like How to tell GCC that a pointer argument is always double-word-aligned?

This also looks a little suspect because of the void*. Its kind of throwing away information about the pointer. You should probably keep the information:
```
void doMemmove(void* pDest, const void* pSource, std::size_t sizeBytes)
{
  memmove(pDest, pSource, sizeBytes);
}
```
Maybe something like the following:
```
template 
void doMemmove(T* pDest, const T* pSource, std::size_t count)
{
  memmove(pDest, pSource, count*sizeof(T));
}
```
Another suggestion is to use new, and stop using malloc. Its a C++ program and GCC can make some assumptions about new that it cannot make about malloc. I believe some of the assumptions are detailed in GCC's option page for the built-ins.

Yet another suggestion is to use the heap. Its always 16-byte aligned on typical modern systems. GCC should recognize it can offload to the MMX unit when a pointer from the heap is involved (sans the potential void* and malloc issues).

Finally, for a while, Clang was not using the native CPU extensions when using -march=native. See, for example, Ubuntu Issue 1616723, Clang 3.4 only advertises SSE2, Ubuntu Issue 1616723, Clang 3.5 only advertises SSE2, and Ubuntu Issue 1616723, Clang 3.6 only advertises SSE2.
0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...