Fastest de-interleave operation in C?

后端 未结 6 1709
一个人的身影
一个人的身影 2021-01-02 00:30

I have a pointer to an array of bytes mixed that contains the interleaved bytes of two distinct arrays array1 and array2. Say mi

6条回答
  •  心在旅途
    2021-01-02 01:05

    1. premature optimisation is bad

    2. your compiler is probably better at optimising than you are.

    That said, there are things you can do to help out the compiler because you have semantic knowledge of your data that a compiler cannot have:

    1. read and write as many bytes as you can, up to the native word size - memory operations are expensive, so do manipulations in registers where possible

    2. unroll loops - look into "Duff's Device".

    FWIW, I produced two versions of your copy loop, one much the same as yours, the second using what most would consider "optimal" (albeit still simple) C code:

    void test1(byte *p, byte *p1, byte *p2, int n)
    {
        int i, j;
        for (i = 0, j = 0; i < n / 2; i++, j += 2) {
            p1[i] = p[j];
            p2[i] = p[j + 1];
        }
    }
    
    void test2(byte *p, byte *p1, byte *p2, int n)
    {
        while (n) {
            *p1++ = *p++;
            *p2++ = *p++;
            n--; n--;
        }
    }
    

    With gcc -O3 -S on Intel x86 they both produced almost identical assembly code. Here are the inner loops:

    LBB1_2:
        movb    -1(%rdi), %al
        movb    %al, (%rsi)
        movb    (%rdi), %al
        movb    %al, (%rdx)
        incq    %rsi
        addq    $2, %rdi
        incq    %rdx
        decq    %rcx
        jne LBB1_2
    

    and

    LBB2_2:
        movb    -1(%rdi), %al
        movb    %al, (%rsi)
        movb    (%rdi), %al
        movb    %al, (%rdx)
        incq    %rsi
        addq    $2, %rdi
        incq    %rdx
        addl    $-2, %ecx
        jne LBB2_2
    

    Both have the same number of instructions, the difference accounted for solely because the first version counts up to n / 2, and the second counts down to zero.

    EDIT here's a better version:

    /* non-portable - assumes little endian */
    void test3(byte *p, byte *p1, byte *p2, int n)
    {
        ushort *ps = (ushort *)p;
    
        n /= 2;
        while (n) {
            ushort n = *ps++;
            *p1++ = n;
            *p2++ = n >> 8;
        }
    }
    

    resulting in:

    LBB3_2:
        movzwl  (%rdi), %ecx
        movb    %cl, (%rsi)
        movb    %ch, (%rdx)  # NOREX
        addq    $2, %rdi
        incq    %rsi
        incq    %rdx
        decq    %rax
        jne LBB3_2
    

    which is one fewer instruction because it takes advantage of the immediate access to %cl and %ch.

提交回复
热议问题