i\'ve been writing in x86 assembly lately (for fun) and was wondering whether or not rep prefixed string instructions actually have a performance edge on modern processors o
Since no one has given you any numbers, I'll give you some which I've found by benchmarking my garbage collector which is very memcpy-heavy. My objects to be copied are 60% 16 bytes in length and the remainder 30% are 500 - 8000 bytes or so.
dst , src and n are multiples of 8.Here are my three memcpy variants:
Hand-coded while-loop:
if (n == 16) {
*dst++ = *src++;
*dst++ = *src++;
} else {
size_t n_ptrs = n / sizeof(ptr);
ptr *end = dst + n_ptrs;
while (dst < end) {
*dst++ = *src++;
}
}
(ptr is an alias to uintptr_t). Time: 101.16%
rep movsb
if (n == 16) {
*dst++ = *src++;
*dst++ = *src++;
} else {
asm volatile("cld\n\t"
"rep ; movsb"
: "=D" (dst), "=S" (src)
: "c" (n), "D" (dst), "S" (src)
: "memory");
}
Time: 103.22%
rep movsq
if (n == 16) {
*dst++ = *src++;
*dst++ = *src++;
} else {
size_t n_ptrs = n / sizeof(ptr);
asm volatile("cld\n\t"
"rep ; movsq"
: "=D" (dst), "=S" (src)
: "c" (n_ptrs), "D" (dst), "S" (src)
: "memory");
}
Time: 100.00%
req movsq wins by a tiny margin.