What is the effect of ordering if…else if statements by probability?

后端未结

关注

 10  1794

花落未央 2020-12-07 13:39

Specifically, if I have a series of if...else if statements, and I somehow know beforehand the relative probability that each statement will evalua

10条回答

攒了一身酷 (楼主)

2020-12-07 14:02

I decided to rerun the test on my own machine using Lik32 code. I had to change it due to my windows or compiler thinking high resolution is 1ms, using

mingw32-g++.exe -O3 -Wall -std=c++11 -fexceptions -g

vector rand_vec(10000000);

GCC has made the same transformation on both original codes.

Note that only the two first conditions are tested as the third must always be true, GCC is a kind of a Sherlock here.

Reverse

.L233:
        mov     DWORD PTR [rsp+104], 0
        mov     DWORD PTR [rsp+100], 0
        mov     DWORD PTR [rsp+96], 0
        call    std::chrono::_V2::system_clock::now()
        mov     rbp, rax
        mov     rax, QWORD PTR [rsp+8]
        jmp     .L219
.L293:
        mov     edx, DWORD PTR [rsp+104]
        add     edx, 1
        mov     DWORD PTR [rsp+104], edx
.L217:
        add     rax, 4
        cmp     r14, rax
        je      .L292
.L219:
        mov     edx, DWORD PTR [rax]
        cmp     edx, 94
        jg      .L293 // >= 95
        cmp     edx, 19
        jg      .L218 // >= 20
        mov     edx, DWORD PTR [rsp+96]
        add     rax, 4
        add     edx, 1 // < 20 Sherlock
        mov     DWORD PTR [rsp+96], edx
        cmp     r14, rax
        jne     .L219
.L292:
        call    std::chrono::_V2::system_clock::now()

.L218: // further down
        mov     edx, DWORD PTR [rsp+100]
        add     edx, 1
        mov     DWORD PTR [rsp+100], edx
        jmp     .L217

And sorted

        mov     DWORD PTR [rsp+104], 0
        mov     DWORD PTR [rsp+100], 0
        mov     DWORD PTR [rsp+96], 0
        call    std::chrono::_V2::system_clock::now()
        mov     rbp, rax
        mov     rax, QWORD PTR [rsp+8]
        jmp     .L226
.L296:
        mov     edx, DWORD PTR [rsp+100]
        add     edx, 1
        mov     DWORD PTR [rsp+100], edx
.L224:
        add     rax, 4
        cmp     r14, rax
        je      .L295
.L226:
        mov     edx, DWORD PTR [rax]
        lea     ecx, [rdx-20]
        cmp     ecx, 74
        jbe     .L296
        cmp     edx, 19
        jle     .L297
        mov     edx, DWORD PTR [rsp+104]
        add     rax, 4
        add     edx, 1
        mov     DWORD PTR [rsp+104], edx
        cmp     r14, rax
        jne     .L226
.L295:
        call    std::chrono::_V2::system_clock::now()

.L297: // further down
        mov     edx, DWORD PTR [rsp+96]
        add     edx, 1
        mov     DWORD PTR [rsp+96], edx
        jmp     .L224

So this doesn't tell us much except that the last case doesn't need a branch predict.

Now I tried all 6 combinations of the if's, the top 2 are the original reverse and sorted. high is >= 95, low is < 20, mid is 20-94 with 10000000 iterations each.

high, low, mid: 43000000ns
mid, low, high: 46000000ns
high, mid, low: 45000000ns
low, mid, high: 44000000ns
mid, high, low: 46000000ns
low, high, mid: 44000000ns

high, low, mid: 44000000ns
mid, low, high: 47000000ns
high, mid, low: 44000000ns
low, mid, high: 45000000ns
mid, high, low: 46000000ns
low, high, mid: 45000000ns

high, low, mid: 43000000ns
mid, low, high: 47000000ns
high, mid, low: 44000000ns
low, mid, high: 45000000ns
mid, high, low: 46000000ns
low, high, mid: 44000000ns

high, low, mid: 42000000ns
mid, low, high: 46000000ns
high, mid, low: 46000000ns
low, mid, high: 45000000ns
mid, high, low: 46000000ns
low, high, mid: 43000000ns

high, low, mid: 43000000ns
mid, low, high: 47000000ns
high, mid, low: 44000000ns
low, mid, high: 44000000ns
mid, high, low: 46000000ns
low, high, mid: 44000000ns

high, low, mid: 43000000ns
mid, low, high: 48000000ns
high, mid, low: 44000000ns
low, mid, high: 44000000ns
mid, high, low: 45000000ns
low, high, mid: 45000000ns

high, low, mid: 43000000ns
mid, low, high: 47000000ns
high, mid, low: 45000000ns
low, mid, high: 45000000ns
mid, high, low: 46000000ns
low, high, mid: 44000000ns

high, low, mid: 43000000ns
mid, low, high: 47000000ns
high, mid, low: 45000000ns
low, mid, high: 45000000ns
mid, high, low: 46000000ns
low, high, mid: 44000000ns

high, low, mid: 43000000ns
mid, low, high: 46000000ns
high, mid, low: 45000000ns
low, mid, high: 45000000ns
mid, high, low: 45000000ns
low, high, mid: 44000000ns

high, low, mid: 42000000ns
mid, low, high: 46000000ns
high, mid, low: 44000000ns
low, mid, high: 45000000ns
mid, high, low: 45000000ns
low, high, mid: 44000000ns

1900020, 7498968, 601012

Process returned 0 (0x0)   execution time : 2.899 s
Press any key to continue.

So why is the order high, low, med then faster (marginally)

Because the most unpredictable is last and therefore is never run through a branch predictor.

          if (i >= 95) ++nHigh;               // most predictable with 94% taken
          else if (i < 20) ++nLow; // (94-19)/94% taken ~80% taken
          else if (i >= 20 && i < 95) ++nMid; // never taken as this is the remainder of the outfalls.

So the branches will be predicted taken, taken and remainder with

6%+(0.94*)20% mispredicts.

"Sorted"

          if (i >= 20 && i < 95) ++nMid;  // 75% not taken
          else if (i < 20) ++nLow;        // 19/25 76% not taken
          else if (i >= 95) ++nHigh;      //Least likely branch

The branches will be predicted with not taken, not taken and Sherlock.

25%+(0.75*)24% mispredicts

Giving 18-23% difference (measured difference of ~9%) but we need to calculate cycles instead of mispredicting %.

Let's assume 17 cycles mispredict penalty on my Nehalem CPU and that each check takes 1 cycle to issue (4-5 instructions) and the loop takes one cycle too. The data dependencies are the counters and the loop variables, but once the mispredicts are out of the way it shouldn't influence the timing.

So for "reverse", we get the timings (this should be the formula used in Computer Architecture: A Quantitative Approach IIRC).

mispredict*penalty+count+loop
0.06*17+1+1+    (=3.02)
(propability)*(first check+mispredict*penalty+count+loop)
(0.19)*(1+0.20*17+1+1)+  (= 0.19*6.4=1.22)
(propability)*(first check+second check+count+loop)
(0.75)*(1+1+1+1) (=3)
= 7.24 cycles per iteration

and the same for "sorted"

0.25*17+1+1+ (=6.25)
(1-0.75)*(1+0.24*17+1+1)+ (=.25*7.08=1.77)
(1-0.75-0.19)*(1+1+1+1)  (= 0.06*4=0.24)
= 8.26

(8.26-7.24)/8.26 = 13.8% vs. ~9% measured (close to the measured!?!).

So the obvious of the OP is not obvious.

With these tests, other tests with more complicated code or more data dependencies will certainly be different so measure your case.

Changing the order of the test changed the results but that could be because of different alignments of the loop start which should ideally be 16 bytes aligned on all newer Intel CPUs but isn't in this case.

0 讨论(0)

查看其它10个回答