std::vector performance regression when enabling C++11

后端 未结 1 493
温柔的废话
温柔的废话 2020-12-12 09:07

I have found an interesting performance regression in a small C++ snippet, when I enable C++11:

#include 

struct Item
{
  int a;
  int b;
};

         


        
相关标签:
1条回答
  • 2020-12-12 09:53

    I can reproduce your results on my machine with those options you write in your post.

    However, if I also enable link time optimization (I also pass the -flto flag to gcc 4.7.2), the results are identical:

    (I am compiling your original code, with container.push_back(Item());)

    $ g++ -std=c++11 -O3 -flto regr.cpp && perf stat -r 10 ./a.out 
    
     Performance counter stats for './a.out' (10 runs):
    
             35.426793 task-clock                #    0.986 CPUs utilized            ( +-  1.75% )
                     4 context-switches          #    0.116 K/sec                    ( +-  5.69% )
                     0 CPU-migrations            #    0.006 K/sec                    ( +- 66.67% )
                19,801 page-faults               #    0.559 M/sec                  
            99,028,466 cycles                    #    2.795 GHz                      ( +-  1.89% ) [77.53%]
            50,721,061 stalled-cycles-frontend   #   51.22% frontend cycles idle     ( +-  3.74% ) [79.47%]
            25,585,331 stalled-cycles-backend    #   25.84% backend  cycles idle     ( +-  4.90% ) [73.07%]
           141,947,224 instructions              #    1.43  insns per cycle        
                                                 #    0.36  stalled cycles per insn  ( +-  0.52% ) [88.72%]
            37,697,368 branches                  # 1064.092 M/sec                    ( +-  0.52% ) [88.75%]
                26,700 branch-misses             #    0.07% of all branches          ( +-  3.91% ) [83.64%]
    
           0.035943226 seconds time elapsed                                          ( +-  1.79% )
    
    
    
    $ g++ -std=c++98 -O3 -flto regr.cpp && perf stat -r 10 ./a.out 
    
     Performance counter stats for './a.out' (10 runs):
    
             35.510495 task-clock                #    0.988 CPUs utilized            ( +-  2.54% )
                     4 context-switches          #    0.101 K/sec                    ( +-  7.41% )
                     0 CPU-migrations            #    0.003 K/sec                    ( +-100.00% )
                19,801 page-faults               #    0.558 M/sec                    ( +-  0.00% )
            98,463,570 cycles                    #    2.773 GHz                      ( +-  1.09% ) [77.71%]
            50,079,978 stalled-cycles-frontend   #   50.86% frontend cycles idle     ( +-  2.20% ) [79.41%]
            26,270,699 stalled-cycles-backend    #   26.68% backend  cycles idle     ( +-  8.91% ) [74.43%]
           141,427,211 instructions              #    1.44  insns per cycle        
                                                 #    0.35  stalled cycles per insn  ( +-  0.23% ) [87.66%]
            37,366,375 branches                  # 1052.263 M/sec                    ( +-  0.48% ) [88.61%]
                26,621 branch-misses             #    0.07% of all branches          ( +-  5.28% ) [83.26%]
    
           0.035953916 seconds time elapsed  
    

    As for the reasons, one needs to look at the generated assembly code (g++ -std=c++11 -O3 -S regr.cpp). In C++11 mode the generated code is significantly more cluttered than for C++98 mode and inlining the function
    void std::vector<Item,std::allocator<Item>>::_M_emplace_back_aux<Item>(Item&&)
    fails in C++11 mode with the default inline-limit.

    This failed inline has a domino effect. Not because this function is being called (it is not even called!) but because we have to be prepared: If it is called, the function argments (Item.a and Item.b) must already be at the right place. This leads to a pretty messy code.

    Here is the relevant part of the generated code for the case where inlining succeeds:

    .L42:
        testq   %rbx, %rbx  # container$D13376$_M_impl$_M_finish
        je  .L3 #,
        movl    $0, (%rbx)  #, container$D13376$_M_impl$_M_finish_136->a
        movl    $0, 4(%rbx) #, container$D13376$_M_impl$_M_finish_136->b
    .L3:
        addq    $8, %rbx    #, container$D13376$_M_impl$_M_finish
        subq    $1, %rbp    #, ivtmp.106
        je  .L41    #,
    .L14:
        cmpq    %rbx, %rdx  # container$D13376$_M_impl$_M_finish, container$D13376$_M_impl$_M_end_of_storage
        jne .L42    #,
    

    This is a nice and compact for loop. Now, let's compare this to that of the failed inline case:

    .L49:
        testq   %rax, %rax  # D.15772
        je  .L26    #,
        movq    16(%rsp), %rdx  # D.13379, D.13379
        movq    %rdx, (%rax)    # D.13379, *D.15772_60
    .L26:
        addq    $8, %rax    #, tmp75
        subq    $1, %rbx    #, ivtmp.117
        movq    %rax, 40(%rsp)  # tmp75, container.D.13376._M_impl._M_finish
        je  .L48    #,
    .L28:
        movq    40(%rsp), %rax  # container.D.13376._M_impl._M_finish, D.15772
        cmpq    48(%rsp), %rax  # container.D.13376._M_impl._M_end_of_storage, D.15772
        movl    $0, 16(%rsp)    #, D.13379.a
        movl    $0, 20(%rsp)    #, D.13379.b
        jne .L49    #,
        leaq    16(%rsp), %rsi  #,
        leaq    32(%rsp), %rdi  #,
        call    _ZNSt6vectorI4ItemSaIS0_EE19_M_emplace_back_auxIIS0_EEEvDpOT_   #
    

    This code is cluttered and there is a lot more going on in the loop than in the previous case. Before the function call (last line shown), the arguments must be placed appropriately:

    leaq    16(%rsp), %rsi  #,
    leaq    32(%rsp), %rdi  #,
    call    _ZNSt6vectorI4ItemSaIS0_EE19_M_emplace_back_auxIIS0_EEEvDpOT_   #
    

    Even though this is never actually executed, the loop arranges the things before:

    movl    $0, 16(%rsp)    #, D.13379.a
    movl    $0, 20(%rsp)    #, D.13379.b
    

    This leads to the messy code. If there is no function call because inlining succeeds, we have only 2 move instructions in the loop and there is no messing going with the %rsp (stack pointer). However, if the inlining fails, we get 6 moves and we mess a lot with the %rsp.

    Just to substantiate my theory (note the -finline-limit), both in C++11 mode:

     $ g++ -std=c++11 -O3 -finline-limit=105 regr.cpp && perf stat -r 10 ./a.out
    
     Performance counter stats for './a.out' (10 runs):
    
             84.739057 task-clock                #    0.993 CPUs utilized            ( +-  1.34% )
                     8 context-switches          #    0.096 K/sec                    ( +-  2.22% )
                     1 CPU-migrations            #    0.009 K/sec                    ( +- 64.01% )
                19,801 page-faults               #    0.234 M/sec                  
           266,809,312 cycles                    #    3.149 GHz                      ( +-  0.58% ) [81.20%]
           206,804,948 stalled-cycles-frontend   #   77.51% frontend cycles idle     ( +-  0.91% ) [81.25%]
           129,078,683 stalled-cycles-backend    #   48.38% backend  cycles idle     ( +-  1.37% ) [69.49%]
           183,130,306 instructions              #    0.69  insns per cycle        
                                                 #    1.13  stalled cycles per insn  ( +-  0.85% ) [85.35%]
            38,759,720 branches                  #  457.401 M/sec                    ( +-  0.29% ) [85.43%]
                24,527 branch-misses             #    0.06% of all branches          ( +-  2.66% ) [83.52%]
    
           0.085359326 seconds time elapsed                                          ( +-  1.31% )
    
     $ g++ -std=c++11 -O3 -finline-limit=106 regr.cpp && perf stat -r 10 ./a.out
    
     Performance counter stats for './a.out' (10 runs):
    
             37.790325 task-clock                #    0.990 CPUs utilized            ( +-  2.06% )
                     4 context-switches          #    0.098 K/sec                    ( +-  5.77% )
                     0 CPU-migrations            #    0.011 K/sec                    ( +- 55.28% )
                19,801 page-faults               #    0.524 M/sec                  
           104,699,973 cycles                    #    2.771 GHz                      ( +-  2.04% ) [78.91%]
            58,023,151 stalled-cycles-frontend   #   55.42% frontend cycles idle     ( +-  4.03% ) [78.88%]
            30,572,036 stalled-cycles-backend    #   29.20% backend  cycles idle     ( +-  5.31% ) [71.40%]
           140,669,773 instructions              #    1.34  insns per cycle        
                                                 #    0.41  stalled cycles per insn  ( +-  1.40% ) [88.14%]
            38,117,067 branches                  # 1008.646 M/sec                    ( +-  0.65% ) [89.38%]
                27,519 branch-misses             #    0.07% of all branches          ( +-  4.01% ) [86.16%]
    
           0.038187580 seconds time elapsed                                          ( +-  2.05% )
    

    Indeed, if we ask the compiler to try just a little bit harder to inline that function, the difference in performance goes away.


    So what is the take away from this story? That failed inlines can cost you a lot and you should make full use of the compiler capabilities: I can only recommend link time optimization. It gave a significant performance boost to my programs (up to 2.5x) and all I needed to do is to pass the -flto flag. That's a pretty good deal! ;)

    However, I do not recommend trashing your code with the inline keyword; let the compiler decide what to do. (The optimizer is allowed to treat the inline keyword as white space anyway.)


    Great question, +1!

    0 讨论(0)
提交回复
热议问题