Difficulties to measure C/C++ performance

后端 未结 3 1400
执笔经年
执笔经年 2021-02-19 18:07

I wrote a piece of C code to show a point in a discussion about optimizations and branch prediction. Then I noticed even more diverse outcome than I did expect. My goal was to w

3条回答
  •  青春惊慌失措
    2021-02-19 18:26

    With -O1, gcc-4.7.1 calls unpredictableIfs only once and resuses the result, since it recognizes that it's a pure function, so the result will be the same every time it's called. (Mine did, verified by looking at the generated assembly.)

    With higher optimisation level, the functions are inlined, and the compiler doesn't recognize that it's the same code anymore, so it is run each time a function call appears in the source.

    Apart from that, my gcc-4.7.1 deals best with unpredictableIfs when using -O1 or -O2 (apart from the reuse issue, both produce the same code), while noIfs is treated much better with -O3. The timings between the different runs of the same code are however consistent here - equal or differing by 10 milliseconds (granularity of clock), so I have no idea what could cause the substantially different times for unpredictableIfs you reported for -O3.

    With -O2, the loop for unpredictableIfs is identical to the code generated with -O1 (except for register swapping):

    .L12:
        movl    %eax, %ecx
        andl    $1073741826, %ecx
        cmpl    $1, %ecx
        adcl    $0, %edx
        addl    $1, %eax
        cmpl    $1000000000, %eax
        jne .L12
    

    and for noIfs it's similar:

    .L15:
        xorl    %ecx, %ecx
        testl   $1073741826, %eax
        sete    %cl
        addl    $1, %eax
        addl    %ecx, %edx
        cmpl    $1000000000, %eax
        jne .L15
    

    where it was

    .L7:
        testl   $1073741826, %edx
        sete    %cl
        movzbl  %cl, %ecx
        addl    %ecx, %eax
        addl    $1, %edx
        cmpl    $1000000000, %edx
        jne .L7
    

    with -O1. Both loops run in similar time, with unpredictableIfs a bit faster.

    With -O3, the loop for unpredictableIfs becomes worse,

    .L14:
        leal    1(%rdx), %ecx
        testl   $1073741826, %eax
        cmove   %ecx, %edx
        addl    $1, %eax
        cmpl    $1000000000, %eax
        jne     .L14
    

    and for noIfs (including the setup-code here), it becomes better:

        pxor    %xmm2, %xmm2
        movq    %rax, 32(%rsp)
        movdqa  .LC3(%rip), %xmm6
        xorl    %eax, %eax
        movdqa  .LC2(%rip), %xmm1
        movdqa  %xmm2, %xmm3
        movdqa  .LC4(%rip), %xmm5
        movdqa  .LC5(%rip), %xmm4
        .p2align 4,,10
        .p2align 3
    .L18:
        movdqa  %xmm1, %xmm0
        addl    $1, %eax
        paffffd   %xmm6, %xmm1
        cmpl    $250000000, %eax
        pand    %xmm5, %xmm0
        pcmpeqd %xmm3, %xmm0
        pand    %xmm4, %xmm0
        paffffd   %xmm0, %xmm2
        jne .L18
    
    .LC2:
        .long   0
        .long   1
        .long   2
        .long   3
        .align 16
    .LC3:
        .long   4
        .long   4
        .long   4
        .long   4
        .align 16
    .LC4:
        .long   1073741826
        .long   1073741826
        .long   1073741826
        .long   1073741826
        .align 16
    .LC5:
        .long   1
        .long   1
        .long   1
        .long   1
    

    it computes four iterations at once, and accordingly, noIfs runs almost four times as fast then.

提交回复
热议问题