Effective optimization strategies on modern C++ compilers

后端 未结 19 2046
梦如初夏
梦如初夏 2020-12-22 17:02

I\'m working on scientific code that is very performance-critical. An initial version of the code has been written and tested, and now, with profiler in hand, it\'s time to

相关标签:
19条回答
  • 2020-12-22 17:29

    How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?

    I can't speak for all compilers, but my experience with GCC shows that it will not heavily optimize code with respect to the cache. I would expect this to be true for most modern compilers. Optimization such as reordering nested loops can definitely affect performance. If you believe that you have memory access patterns that could lead to many cache misses, it will be in your interest to investigate this.

    0 讨论(0)
  • 2020-12-22 17:34

    Take a look at the excellent Pitfalls of Object-Oriented Programming slides for some info about restructuring code for locality. In my experience getting better locality is almost always the biggest win.

    General process:

    • Learn to love the Disassembly View in your debugger, or have your build system generate the intermediate assembly files (.s) if at all possible. Keep an eye on changes or for things that look egregious -- even without familiarity with a given instruction set architecture, you should be able to see some things fairly clearly! (I sometimes check in a series of .s files with corresponding .cpp/.c changes, just to leverage the lovely tools from my SCM to watch the code and corresponding asm change over time.)
    • Get a profiler that can watch your CPU's performance counters, or can at least guess at cache misses. (AMD CodeAnalyst, cachegrind, vTune, etc.)

    Some other specific things:

    • Understand strict aliasing. Once you do, make use of restrict if your compiler has it. (Examine the disasm here too!)
    • Check out different floating point modes on your processor and compiler. If you don't need the denormalized range, choosing a mode without this can result in better performance. (It sounds like you've already done some things in this area, based on your discussion of rounding modes.)
    • Definitely avoid allocs: call reserve on std::vector when you can, or use std::array when you know the size at compile-time.
    • Use memory pools to increase locality and decrease alloc/free overhead; also to ensure cacheline alignment and prevent ping-ponging.
    • Use frame allocators if you're allocating things in predictable patterns, and can afford to deallocate everything in one go.
    • Do be aware of invariants. Something you know is invariant may not be to the compiler, for example a use of a struct or class member in a loop. I find the single easiest way to fall into the correct habit here is to give a name to everything, and prefer to name things outside of loops. E.g. const int threshold = m_currentThreshold; or perhaps Thing * const pThing = pStructHoldingThing->pThing; Fortunately you can usually see things that need this treatment in the disassembly view. This also helps with debugging later (makes the watch/locals window behave much more nicely in debug builds)!
    • Avoid writes in loops if possible -- accumulate first, then write, or batch a few writes together. YMMV, of course.

    WRT your std::priority_queue question: inserting things into a vector (the default backend for a priority_queue) tends to move a lot of elements around. If you can break up into phases, where you insert data, then sort it, then read it once it's sorted, you'll probably be a lot better off. Although you'll definitely lose locality, you may find a more self-ordering structure like a std::map or std::set worth the overhead -- but this is really dependent on your usage patterns.

    0 讨论(0)
  • 2020-12-22 17:34

    i'm surprised no one has mentioned these two:

    • Link time optimization clang and g++ from 4.5 on support link time optimizations. I've heard that on g++ case, the heuristics is still pretty inmature but it should improve quickly since the main architecture is laid out.

      Benefits range from inter procedural optimizations at object file level, including highly sought stuff like inling of virtual calls (devirtualization)

    • Project inlining this might seem to some like very crude approach, but it is that very crudeness which makes it so powerful: this amounts at dumping all your headers and .cpp files into a single, really big .cpp file and compile that; basically it will give you the same benefits of link-time optimization in your trip back to 1999. Of course, if your project is really big, you'll still need a 2010 machine; this thing will eat your RAM like there is no tomorrow. However, even in that case, you can split it in more than one no-so-damn-huge .cpp file

    0 讨论(0)
  • 2020-12-22 17:36

    If you work on big matrices for instance, consider tiling your loops to improve the locality. This often leads to dramatic improvements. You can use VTune/PTU to monitor the L2 cache misses.

    0 讨论(0)
  • 2020-12-22 17:38

    Here is a nice paper on the subject.

    0 讨论(0)
  • 2020-12-22 17:38

    Here is something that worked for me once. I can't say that it will work for you. I had code on the lines of

    switch(num) {
       case 1: result = f1(param); break;
       case 2: result = f2(param); break;
       //...
    }
    

    Then I got a serious performance boost when I changed it to

    // init:
    funcs[N] = {f1, f2 /*...*/};
    // later in the code:
    result = (funcs[num])(param);
    

    Perhaps someone here can explain the reason the latter version is better. I suppose it has something to do with the fact that there are no conditional branches there.

    0 讨论(0)
提交回复
热议问题