I would like to use enhanced REP MOVSB (ERMSB) to get a high bandwidth for a custom memcpy
.
ERMSB was introduced with the Ivy Bridge microarchitecture
As a general memcpy()
guide:
a) If the data being copied is tiny (less than maybe 20 bytes) and has a fixed size, let the compiler do it. Reason: Compiler can use normal mov
instructions and avoid the startup overheads.
b) If the data being copied is small (less than about 4 KiB) and is guaranteed to be aligned, use rep movsb
(if ERMSB is supported) or rep movsd
(if ERMSB is not supported). Reason: Using an SSE or AVX alternative has a huge amount of "startup overhead" before it copies anything.
c) If the data being copied is small (less than about 4 KiB) and is not guaranteed to be aligned, use rep movsb
. Reason: Using SSE or AVX, or using rep movsd
for the bulk of it plus some rep movsb
at the start or end, has too much overhead.
d) For all other cases use something like this:
mov edx,0
.again:
pushad
.nextByte:
pushad
popad
mov al,[esi]
pushad
popad
mov [edi],al
pushad
popad
inc esi
pushad
popad
inc edi
pushad
popad
loop .nextByte
popad
inc edx
cmp edx,1000
jb .again
Reason: This will be so slow that it will force programmers to find an alternative that doesn't involve copying huge globs of data; and the resulting software will be significantly faster because copying large globs of data was avoided.