Using bools in calculations to avoid branches

后端 未结 2 2031
萌比男神i
萌比男神i 2021-01-02 10:43

Here\'s a little micro-optimization curiosity that I came up with:

struct Timer {
    bool running{false};
    int ticks{0};

    void step_versionOne(int mS         


        
相关标签:
2条回答
  • 2021-01-02 11:29

    Does the second version avoid a branch

    if you compile your code to get assembler output, g++ -o test.s test.cpp -S, you'll find that a branch is indeed avoided in the second function.

    and consequently, is faster than the first version

    i ran each of your functions 2147483647 or INT_MAX number of times where in each iteration i randomly assigned a boolean value to running member of your Timer struct, using this code:

    int main() {
        const int max = std::numeric_limits<int>::max();
        timestamp_t start, end, one, two;
        Timer t_one, t_two;
        double percent;
    
        srand(time(NULL));
    
        start = get_timestamp();
        for(int i = 0; i < max; ++i) {
            t_one.running = rand() % 2;
            t_one.step_versionOne(1);
        }
        end = get_timestamp();
        one = end - start;
    
        std::cout << "step_versionOne      = " << one << std::endl;
    
        start = get_timestamp();
        for(int i = 0; i < max; ++i) {
            t_two.running = rand() % 2;
            t_two.step_versionTwo(1);
        }
        end = get_timestamp();
        two = end - start;
    
        percent = (one - two) / static_cast<double>(one) * 100.0;
    
        std::cout << "step_versionTwo      = " << two << std::endl;
        std::cout << "step_one - step_two  = " << one - two << std::endl;
        std::cout << "one fast than two by = " << percent << std::endl;
     }
    

    and these are the results i got:

    step_versionOne      = 39738380
    step_versionTwo      = 26047337
    step_one - step_two  = 13691043
    one fast than two by = 34.4529%
    

    so yes, the second function is clearly faster, and by around 35%. note that the percentage increase in timed performance varied between 30 and 55 percent for a smaller number of iterations, whereas it seems to plateau at around 35% the longer it runs. this is might be due to sporadic execution of system tasks while the simulation is running, which become a lot less sporadic, i.e. consistent the longer you run the sim (although this is just my assumption, i have no idea if it's actually true)

    all in all, nice question, i learned something today!


    MORE:


    of course, by randomly generating running, we are essentially rendering branch prediction useless in the first function, so the results above are not too surprising. however, if we decide to not alter running during loop iterations and instead leave it at its default value, in this case false, branch prediction will do its magic in the first function, and will actually be faster by almost 20% as these results suggest:

    step_versionOne      = 6273942
    step_versionTwo      = 7809508
    step_two - step_one  = 1535566
    two fast than one by = 19.6628
    

    because running is constant throughout execution, notice that the simulation time is much shorter than it was with a randomly changing running - result of a compiler optimization likely.

    why is the second function slower in this case? well, branch prediction will quickly realize that the condition in the first function is never met, and so will stop checking in the first place (as though if(running) ticks += mStepSize; isn't even there). on the other hand, the second function will still have to perform this instruction ticks += mStepSize * static_cast<int>(running); in every iteration, thus making the first function more efficient.

    but what if we set running to true? well, the branch prediction will kick in again, however, this time, the first function will have to evaluate ticks += mStepSize; in every iteration; here the results when running{true}:

    step_versionOne      = 7522095
    step_versionTwo      = 7891948
    step_two - step_one  = 369853
    two fast than one by = 4.68646
    

    notice that step_versionTwo takes a consistent amount of time whether running is constantly true or false. but it still takes longer than step_versionTwo, however marginally. well, this might be because i was too lazy to run it a lot of times to determine whether it's consistently faster or whether it was a one time fluke (results vary slightly every time you run it, since the OS has to run in the background and it's not always going to do the same thing). but if it is consistently faster, it might be because function two (ticks += mStepSize * static_cast<int>(running);) has an arithmetic op more than function one (ticks += mStepSize;).

    finally, let's compile with an optimization - g++ -o test test.cpp -std=c++11 -O1 and let's revert running back to false and then check the results:

    step_versionOne      = 704973
    step_versionTwo      = 695052
    

    more or less the same. the compiler will do its optimization pass, and realize running is always false and will thus, for all intents and purposes, remove the body of step_versionOne, so when you call it from the loop in main, it'll just call the function and return.

    on the other hand, when optimizing the second function, it will realize that ticks += mStepSize * static_cast<int>(running); will always generate the same result, i.e. 0, so it won't bother executing that either.

    all in all, if i'm correct (and if not, please correct me, i'm pretty new to this), all you'll get when calling both functions from the main loop is their overhead.

    p.s. here's the result for the first case (running is randomly generated in every iteration) when compiled with an optimization.

    step_versionOne      = 18868782
    step_versionTwo      = 18812315
    step_two - step_one  = 56467
    one fast than two by = 0.299261
    
    0 讨论(0)
  • Yes, your trick allows to avoid branch and it makes it faster... sometimes.

    I wrote benchmark that compares these solutions in various situations, along with my own:

    ticks += mStepSize & -static_cast<int>(running)
    

    My results are following:

    Off:
     branch: 399949150
     mul:    399940271
     andneg: 277546678
    On:
     branch: 204035423
     mul:    399937142
     andneg: 277581853
    Pattern:
     branch: 327724860
     mul:    400010363
     andneg: 277551446
    Random:
     branch: 915235440
     mul:    399916440
     andneg: 277537411
    

    Off is when timers are turned off. In this cases solutions take about the same time.

    On is when they are turned on. Branching solution two times faster.

    Pattern is when they are in 100110 pattern. Performance is similar, but branching is a bit faster.

    Random is when branch is unpredictable. In this case multiplications is more than 2 times faster.

    In all cases my bit-hacking trick is fastest, except for On where branching wins.

    Note that this benchmark is not necessarily representative for all compiler versions processors etc. Even small changes of benchmark can turn results upside down (for example if compiler can inline knowing mStepSize is 1 than multiplication can be actually fastest).

    Code of the benchmark:

    #include<array>
    #include<iostream>
    #include<chrono>
    
    struct Timer {
        bool running{false};
        int ticks{0};
    
        void branch(int mStepSize) {
            if(running) ticks += mStepSize;
        }
    
        void mul(int mStepSize) {
            ticks += mStepSize * static_cast<int>(running);
        }
    
        void andneg(int mStepSize) {
            ticks += mStepSize & -static_cast<int>(running);
        }
    };
    
    void run(std::array<Timer, 256>& timers, int step) {
        auto start = std::chrono::steady_clock::now();
        for(int i = 0; i < 1000000; i++)
            for(auto& t : timers)
                t.branch(step);
        auto end = std::chrono::steady_clock::now();
        std::cout << "branch: " << (end - start).count() << std::endl;
        start = std::chrono::steady_clock::now();
        for(int i = 0; i < 1000000; i++)
            for(auto& t : timers)
                t.mul(step);
        end = std::chrono::steady_clock::now();
        std::cout << "mul:    " << (end - start).count() << std::endl;
        start = std::chrono::steady_clock::now();
        for(int i = 0; i < 1000000; i++)
            for(auto& t : timers)
                t.andneg(step);
        end = std::chrono::steady_clock::now();
        std::cout << "andneg: " << (end - start).count() << std::endl;
    }
    
    int main() {
        std::array<Timer, 256> timers;
        int step = rand() % 256;
    
        run(timers, step); // warm up
        std::cout << "Off:\n";
        run(timers, step);
        for(auto& t : timers)
            t.running = true;
        std::cout << "On:\n";
        run(timers, step);
        std::array<bool, 6> pattern = {1, 0, 0, 1, 1, 0};
        for(int i = 0; i < 256; i++)
            timers[i].running = pattern[i % 6];
        std::cout << "Pattern:\n";
        run(timers, step);
        for(auto& t : timers)
            t.running = rand()&1;
        std::cout << "Random:\n";
        run(timers, step);
        for(auto& t : timers)
            std::cout << t.ticks << ' ';
        return 0;
    }
    
    0 讨论(0)
提交回复
热议问题