Performance of x86 rep instructions on modern (pipelined/superscalar) processors

前端未结

关注

 3  499

耶瑟儿～ 2020-12-13 04:46

i\'ve been writing in x86 assembly lately (for fun) and was wondering whether or not rep prefixed string instructions actually have a performance edge on modern processors o

3条回答

不知归路 (楼主)

2020-12-13 05:29

Since no one has given you any numbers, I'll give you some which I've found by benchmarking my garbage collector which is very memcpy-heavy. My objects to be copied are 60% 16 bytes in length and the remainder 30% are 500 - 8000 bytes or so.

Precondition: Both dst , src and n are multiples of 8.
Processor: AMD Phenom(tm) II X6 1090T Processor 64bit/linux

Here are my three memcpy variants:

Hand-coded while-loop:

if (n == 16) {
    *dst++ = *src++;
    *dst++ = *src++;
} else {
    size_t n_ptrs = n / sizeof(ptr);
    ptr *end = dst + n_ptrs;
    while (dst < end) {
        *dst++ = *src++;
    }
}

(ptr is an alias to uintptr_t). Time: 101.16%

rep movsb

if (n == 16) {
    *dst++ = *src++;
    *dst++ = *src++;
} else {
    asm volatile("cld\n\t"
                 "rep ; movsb"
                 : "=D" (dst), "=S" (src)
                 : "c" (n), "D" (dst), "S" (src)
                 : "memory");
}

Time: 103.22%

rep movsq

if (n == 16) {
    *dst++ = *src++;
    *dst++ = *src++;
} else {
    size_t n_ptrs = n / sizeof(ptr);
    asm volatile("cld\n\t"
                 "rep ; movsq"
                 : "=D" (dst), "=S" (src)
                 : "c" (n_ptrs), "D" (dst), "S" (src)
                 : "memory");
}

Time: 100.00%

req movsq wins by a tiny margin.

0 讨论(0)

查看其它3个回答