Performance of x86 rep instructions on modern (pipelined/superscalar) processors

前端 未结 3 499
耶瑟儿~
耶瑟儿~ 2020-12-13 04:46

i\'ve been writing in x86 assembly lately (for fun) and was wondering whether or not rep prefixed string instructions actually have a performance edge on modern processors o

3条回答
  •  不知归路
    2020-12-13 05:29

    Since no one has given you any numbers, I'll give you some which I've found by benchmarking my garbage collector which is very memcpy-heavy. My objects to be copied are 60% 16 bytes in length and the remainder 30% are 500 - 8000 bytes or so.

    • Precondition: Both dst , src and n are multiples of 8.
    • Processor: AMD Phenom(tm) II X6 1090T Processor 64bit/linux

    Here are my three memcpy variants:

    Hand-coded while-loop:

    if (n == 16) {
        *dst++ = *src++;
        *dst++ = *src++;
    } else {
        size_t n_ptrs = n / sizeof(ptr);
        ptr *end = dst + n_ptrs;
        while (dst < end) {
            *dst++ = *src++;
        }
    }
    

    (ptr is an alias to uintptr_t). Time: 101.16%

    rep movsb

    if (n == 16) {
        *dst++ = *src++;
        *dst++ = *src++;
    } else {
        asm volatile("cld\n\t"
                     "rep ; movsb"
                     : "=D" (dst), "=S" (src)
                     : "c" (n), "D" (dst), "S" (src)
                     : "memory");
    }
    

    Time: 103.22%

    rep movsq

    if (n == 16) {
        *dst++ = *src++;
        *dst++ = *src++;
    } else {
        size_t n_ptrs = n / sizeof(ptr);
        asm volatile("cld\n\t"
                     "rep ; movsq"
                     : "=D" (dst), "=S" (src)
                     : "c" (n_ptrs), "D" (dst), "S" (src)
                     : "memory");
    }
    

    Time: 100.00%

    req movsq wins by a tiny margin.

提交回复
热议问题