Make compiler copy characters using movsd

若如初见. 提交于 2019-12-04 05:26:36

Several questions come to mind.

First, how do you know movsd would be faster? Have you looked up its latency/throughput? The x86 architecture is full of crufty old instructions that should not be used because they're just not very efficient on modern CPU's.

Second, what happens if you use std::copy instead of memcpy? std::copy is potentially faster, as it can be specialized at compile-time for the specific data type.

And third, have you enabled intrinsic functions under project properties -> C/C++ -> Optimization?

Of course I assume other optimizations are enabled as well.

Are you running an optimised build? It won't use an intrinsic unless optimisation is on. Its also worth noting that it will probably use a better copy loop than rep movsd. It should try and use MMX, at the least, to perform a 64-bit at a time copy. In fact 6 or 7 years back I wrote an MMX optimised copy loop for doing this sort of thing. Unfortunately the compiler's intrinsic memcpy outperformed my MMX copy by about 1%. That really taught me not to make assumptions about what the compiler is doing.

Using memcpy with a constant size

What I have found meanwhile:

Compiler will use intrinsic when the copied block size is compile time known. When it is not, is calls the library implementation. When the size is known, the code generated is very nice, selected based on the size. It may be a single mov, or movsd, or movsd followed by movsb, as needed.

It seems that if I really want to use movsb or movsd always, even with a "dynamic" size I will have to use inline assembly or special intrinsic (see below). I know the size is "quite short", but the compiler does not know it and I cannot communicate this to it - I have even tried to use __assume(size<16), but it is not enough.

Demo code, compile with "-Ob1 (expansion for inline only):

  #include <memory.h>

  void MemCpyTest(void *tgt, const void *src, size_t size)
  {
    memcpy(tgt,src,size);
  }

  template <int size>
  void MemCpyTestT(void *tgt, const void *src)
  {
    memcpy(tgt,src,size);
  }

  int main ( int argc, char **argv )
  {
    int src;
    int dst;
    MemCpyTest(&dst,&src,sizeof(dst));
    MemCpyTestT<sizeof(dst)>(&dst,&src);
    return 0;
  }

Specialized intrinsics

I have found recently there exists very simple way how to make Visual Studio compiler copy characters using movsd - very natural and simple: using intrinsics. Following intrinsics may come handy:

Have you timed memcpy? On recent versions of Visual Studio, the memcpy implementation uses SSE2... which should be faster than rep movsd. If the block you're copying is 1 KB, then it's not really a problem that the compiler isn't using an intrinsic since the time for the function call will be negligible compared to the time for the copy.

Note that in order to use movsd, src must point to a memory aligned to 32-bit boundary and its length must be a multiple of 4 bytes.

If it is, why does your code use char * instead of int * or something? If it's not, your question is moot.

If you change char * to int *, you might get better result from std::copy.

Edit: have you measured that the copying is the bottleneck?

Use memcpy. This problem has already been solved.

FYI rep movsd is not always the best, rep movsb can be faster in some circumstances and with SSE and the like the best is movntq [edi], xmm0. You can even optimize further for large amount of memory in using page locality by moving data to a buffer and then moving it to your destination.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!