Memcpy vs Memmove - Debug vs Release

匿名 (未验证) 提交于 2019-12-03 02:38:01

问题:

I got really strange behavior for my x64 multithreading application. The execution time in debug mode is faster than in release mode.

I break the problem down and found the issue: The debug modus optimize (!Note optimition is off!) the memcpy to memmove, which peforms faster. The release mode use still memcpy (!note optimition is on).

This problem slows down my multithreading app in release mode. :(

Anyone any idea?

#include <time.h> #include <iostream>  #define T_SIZE 1024*1024*2  int main() {     clock_t start, end;      char data[T_SIZE];     char store[100][T_SIZE];      start = clock();     for (int i = 0; i < 4000; i++) {         memcpy(store[i % 100], data, T_SIZE);     }     // Debug > Release Time 1040 < 1620     printf("memcpy: %d\n", clock() - start);      start = clock();     for (int i = 0; i < 4000; i++) {         memmove(store[i % 100], data, T_SIZE);     }     // Debug > Release Time 1040 > 923     printf("memmove: %d\n", clock() - start); } 

回答1:

The following answer is valid for VS2013 ONLY

What we have here is actually stranger than just memcpy vs. memmove. It's a case of the intrinsic optimization actually slowing things down. The issue stems from the fact that VS2013 inlines memcopy like thus:

; 73   :        memcpy(store[i % 100], data, sizeof(data));      mov eax, 1374389535             ; 51eb851fH     mul esi     shr edx, 5     imul    eax, edx, 100               ; 00000064H     mov ecx, esi     sub ecx, eax     movsxd  rcx, ecx     shl rcx, 21     add rcx, r14     mov rdx, r13     mov r8d, 16384              ; 00004000H     npad    12     $LL413@wmain:     movups  xmm0, XMMWORD PTR [rdx]     movups  XMMWORD PTR [rcx], xmm0     movups  xmm1, XMMWORD PTR [rdx+16]     movups  XMMWORD PTR [rcx+16], xmm1     movups  xmm0, XMMWORD PTR [rdx+32]     movups  XMMWORD PTR [rcx+32], xmm0     movups  xmm1, XMMWORD PTR [rdx+48]     movups  XMMWORD PTR [rcx+48], xmm1     movups  xmm0, XMMWORD PTR [rdx+64]     movups  XMMWORD PTR [rcx+64], xmm0     movups  xmm1, XMMWORD PTR [rdx+80]     movups  XMMWORD PTR [rcx+80], xmm1     movups  xmm0, XMMWORD PTR [rdx+96]     movups  XMMWORD PTR [rcx+96], xmm0     lea rcx, QWORD PTR [rcx+128]     movups  xmm1, XMMWORD PTR [rdx+112]     movups  XMMWORD PTR [rcx-16], xmm1     lea rdx, QWORD PTR [rdx+128]     dec r8     jne SHORT $LL413@wmain 

The issue with this is that we're doing unaligned SSE loads and stores which is actually slower than just using standard C code. I verified this by grabbing the CRTs implementation from the source code included in with visual studio and making a my_memcpy

As a way of ensuring that the cache was warm during all of this I had preinitialized all of data but the results were telling:

Warm up took 43ms
my_memcpy up took 862ms
memmove up took 676ms
memcpy up took 1329ms

So why is memmove faster? Because it doesn't try to prior optimize because it must assume the data can overlap.

For those curious this is my code in full:

#include <cstdlib> #include <cstring> #include <chrono> #include <iostream> #include <random> #include <functional> #include <limits>  namespace {     const auto t_size = 1024ULL * 1024ULL * 2ULL;     __declspec(align(16 )) char data[t_size];     __declspec(align(16 )) char store[100][t_size];     void * __cdecl my_memcpy(         void * dst,         const void * src,         size_t count         )     {         void * ret = dst;          /*         * copy from lower addresses to higher addresses         */         while (count--) {             *(char *)dst = *(char *)src;             dst = (char *)dst + 1;             src = (char *)src + 1;         }          return(ret);     } }  int wmain(int argc, wchar_t* argv[]) {     using namespace std::chrono;      std::mt19937 rd{ std::random_device()() };     std::uniform_int_distribution<short> dist(std::numeric_limits<char>::min(), std::numeric_limits<char>::max());     auto random = std::bind(dist, rd);      auto start = steady_clock::now();     // warms up the cache and initializes     for (int i = 0; i < t_size; ++i)             data[i] = static_cast<char>(random());      auto stop = steady_clock::now();     std::cout << "Warm up took " << duration_cast<milliseconds>(stop - start).count() << "ms\n";      start = steady_clock::now();     for (int i = 0; i < 4000; ++i)         my_memcpy(store[i % 100], data, sizeof(data));      stop = steady_clock::now();      std::cout << "my_memcpy took " << duration_cast<milliseconds>(stop - start).count() << "ms\n";      start = steady_clock::now();     for (int i = 0; i < 4000; ++i)         memmove(store[i % 100], data, sizeof(data));      stop = steady_clock::now();      std::cout << "memmove took " << duration_cast<milliseconds>(stop - start).count() << "ms\n";       start = steady_clock::now();     for (int i = 0; i < 4000; ++i)         memcpy(store[i % 100], data, sizeof(data));      stop = steady_clock::now();      std::cout << "memcpy took " << duration_cast<milliseconds>(stop - start).count() << "ms\n";     std::cin.ignore();     return 0; } 

Update

While debugging I've found that the compiler did detect that the code I copied from the CRT is memcpy, but it links it to the non-intrinsic version in the CRT itself which uses rep movs instead of the massive SSE loop above. It seems the issue is ONLY with the intrinsic version.

Update 2

Per Z boson in the comments it seems that this is all very architecture dependent. On my CPU rep movsb is faster, but on older CPUs the SSE or AVX implementation has the potential to be faster. This is per the Intel Optimization Manual. For unaligned data as rep movsb can experience up to a 25% penalty on older hardware. However, that said, it appears that for the vast majority of cases and architectures rep movsb will on average beat the SSE or AVX implementation.



回答2:

Idea: call memmove, since it's fastest for your case.



易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!