What is the effect of ordering if…else if statements by probability?

后端 未结 10 1794
花落未央
花落未央 2020-12-07 13:39

Specifically, if I have a series of if...else if statements, and I somehow know beforehand the relative probability that each statement will evalua

10条回答
  •  攒了一身酷
    2020-12-07 14:02

    I decided to rerun the test on my own machine using Lik32 code. I had to change it due to my windows or compiler thinking high resolution is 1ms, using

    mingw32-g++.exe -O3 -Wall -std=c++11 -fexceptions -g

    vector rand_vec(10000000);
    

    GCC has made the same transformation on both original codes.

    Note that only the two first conditions are tested as the third must always be true, GCC is a kind of a Sherlock here.

    Reverse

    .L233:
            mov     DWORD PTR [rsp+104], 0
            mov     DWORD PTR [rsp+100], 0
            mov     DWORD PTR [rsp+96], 0
            call    std::chrono::_V2::system_clock::now()
            mov     rbp, rax
            mov     rax, QWORD PTR [rsp+8]
            jmp     .L219
    .L293:
            mov     edx, DWORD PTR [rsp+104]
            add     edx, 1
            mov     DWORD PTR [rsp+104], edx
    .L217:
            add     rax, 4
            cmp     r14, rax
            je      .L292
    .L219:
            mov     edx, DWORD PTR [rax]
            cmp     edx, 94
            jg      .L293 // >= 95
            cmp     edx, 19
            jg      .L218 // >= 20
            mov     edx, DWORD PTR [rsp+96]
            add     rax, 4
            add     edx, 1 // < 20 Sherlock
            mov     DWORD PTR [rsp+96], edx
            cmp     r14, rax
            jne     .L219
    .L292:
            call    std::chrono::_V2::system_clock::now()
    
    .L218: // further down
            mov     edx, DWORD PTR [rsp+100]
            add     edx, 1
            mov     DWORD PTR [rsp+100], edx
            jmp     .L217
    
    And sorted
    
            mov     DWORD PTR [rsp+104], 0
            mov     DWORD PTR [rsp+100], 0
            mov     DWORD PTR [rsp+96], 0
            call    std::chrono::_V2::system_clock::now()
            mov     rbp, rax
            mov     rax, QWORD PTR [rsp+8]
            jmp     .L226
    .L296:
            mov     edx, DWORD PTR [rsp+100]
            add     edx, 1
            mov     DWORD PTR [rsp+100], edx
    .L224:
            add     rax, 4
            cmp     r14, rax
            je      .L295
    .L226:
            mov     edx, DWORD PTR [rax]
            lea     ecx, [rdx-20]
            cmp     ecx, 74
            jbe     .L296
            cmp     edx, 19
            jle     .L297
            mov     edx, DWORD PTR [rsp+104]
            add     rax, 4
            add     edx, 1
            mov     DWORD PTR [rsp+104], edx
            cmp     r14, rax
            jne     .L226
    .L295:
            call    std::chrono::_V2::system_clock::now()
    
    .L297: // further down
            mov     edx, DWORD PTR [rsp+96]
            add     edx, 1
            mov     DWORD PTR [rsp+96], edx
            jmp     .L224
    

    So this doesn't tell us much except that the last case doesn't need a branch predict.

    Now I tried all 6 combinations of the if's, the top 2 are the original reverse and sorted. high is >= 95, low is < 20, mid is 20-94 with 10000000 iterations each.

    high, low, mid: 43000000ns
    mid, low, high: 46000000ns
    high, mid, low: 45000000ns
    low, mid, high: 44000000ns
    mid, high, low: 46000000ns
    low, high, mid: 44000000ns
    
    high, low, mid: 44000000ns
    mid, low, high: 47000000ns
    high, mid, low: 44000000ns
    low, mid, high: 45000000ns
    mid, high, low: 46000000ns
    low, high, mid: 45000000ns
    
    high, low, mid: 43000000ns
    mid, low, high: 47000000ns
    high, mid, low: 44000000ns
    low, mid, high: 45000000ns
    mid, high, low: 46000000ns
    low, high, mid: 44000000ns
    
    high, low, mid: 42000000ns
    mid, low, high: 46000000ns
    high, mid, low: 46000000ns
    low, mid, high: 45000000ns
    mid, high, low: 46000000ns
    low, high, mid: 43000000ns
    
    high, low, mid: 43000000ns
    mid, low, high: 47000000ns
    high, mid, low: 44000000ns
    low, mid, high: 44000000ns
    mid, high, low: 46000000ns
    low, high, mid: 44000000ns
    
    high, low, mid: 43000000ns
    mid, low, high: 48000000ns
    high, mid, low: 44000000ns
    low, mid, high: 44000000ns
    mid, high, low: 45000000ns
    low, high, mid: 45000000ns
    
    high, low, mid: 43000000ns
    mid, low, high: 47000000ns
    high, mid, low: 45000000ns
    low, mid, high: 45000000ns
    mid, high, low: 46000000ns
    low, high, mid: 44000000ns
    
    high, low, mid: 43000000ns
    mid, low, high: 47000000ns
    high, mid, low: 45000000ns
    low, mid, high: 45000000ns
    mid, high, low: 46000000ns
    low, high, mid: 44000000ns
    
    high, low, mid: 43000000ns
    mid, low, high: 46000000ns
    high, mid, low: 45000000ns
    low, mid, high: 45000000ns
    mid, high, low: 45000000ns
    low, high, mid: 44000000ns
    
    high, low, mid: 42000000ns
    mid, low, high: 46000000ns
    high, mid, low: 44000000ns
    low, mid, high: 45000000ns
    mid, high, low: 45000000ns
    low, high, mid: 44000000ns
    
    1900020, 7498968, 601012
    
    Process returned 0 (0x0)   execution time : 2.899 s
    Press any key to continue.
    

    So why is the order high, low, med then faster (marginally)

    Because the most unpredictable is last and therefore is never run through a branch predictor.

              if (i >= 95) ++nHigh;               // most predictable with 94% taken
              else if (i < 20) ++nLow; // (94-19)/94% taken ~80% taken
              else if (i >= 20 && i < 95) ++nMid; // never taken as this is the remainder of the outfalls.
    

    So the branches will be predicted taken, taken and remainder with

    6%+(0.94*)20% mispredicts.

    "Sorted"

              if (i >= 20 && i < 95) ++nMid;  // 75% not taken
              else if (i < 20) ++nLow;        // 19/25 76% not taken
              else if (i >= 95) ++nHigh;      //Least likely branch
    

    The branches will be predicted with not taken, not taken and Sherlock.

    25%+(0.75*)24% mispredicts

    Giving 18-23% difference (measured difference of ~9%) but we need to calculate cycles instead of mispredicting %.

    Let's assume 17 cycles mispredict penalty on my Nehalem CPU and that each check takes 1 cycle to issue (4-5 instructions) and the loop takes one cycle too. The data dependencies are the counters and the loop variables, but once the mispredicts are out of the way it shouldn't influence the timing.

    So for "reverse", we get the timings (this should be the formula used in Computer Architecture: A Quantitative Approach IIRC).

    mispredict*penalty+count+loop
    0.06*17+1+1+    (=3.02)
    (propability)*(first check+mispredict*penalty+count+loop)
    (0.19)*(1+0.20*17+1+1)+  (= 0.19*6.4=1.22)
    (propability)*(first check+second check+count+loop)
    (0.75)*(1+1+1+1) (=3)
    = 7.24 cycles per iteration
    

    and the same for "sorted"

    0.25*17+1+1+ (=6.25)
    (1-0.75)*(1+0.24*17+1+1)+ (=.25*7.08=1.77)
    (1-0.75-0.19)*(1+1+1+1)  (= 0.06*4=0.24)
    = 8.26
    

    (8.26-7.24)/8.26 = 13.8% vs. ~9% measured (close to the measured!?!).

    So the obvious of the OP is not obvious.

    With these tests, other tests with more complicated code or more data dependencies will certainly be different so measure your case.

    Changing the order of the test changed the results but that could be because of different alignments of the loop start which should ideally be 16 bytes aligned on all newer Intel CPUs but isn't in this case.

提交回复
热议问题