I would like to use enhanced REP MOVSB (ERMSB) to get a high bandwidth for a custom memcpy
.
ERMSB was introduced with the Ivy Bridge microarchitecture
There are far more efficient ways to move data. These days, the implementation of memcpy
will generate architecture specific code from the compiler that is optimized based upon the memory alignment of the data and other factors. This allows better use of non-temporal cache instructions and XMM and other registers in the x86 world.
When you hard-code rep movsb
prevents this use of intrinsics.
Therefore, for something like a memcpy
, unless you are writing something that will be tied to a very specific piece of hardware and unless you are going to take the time to write a highly optimized memcpy
function in assembly (or using C level intrinsics), you are far better off allowing the compiler to figure it out for you.