Effective optimization strategies on modern C++ compilers

后端未结

关注

 19  2060

I\'m working on scientific code that is very performance-critical. An initial version of the code has been written and tested, and now, with profiler in hand, it\'s time to

相关标签:

19条回答

我寻月下人不归

2020-12-22 17:29

How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?

I can't speak for all compilers, but my experience with GCC shows that it will not heavily optimize code with respect to the cache. I would expect this to be true for most modern compilers. Optimization such as reordering nested loops can definitely affect performance. If you believe that you have memory access patterns that could lead to many cache misses, it will be in your interest to investigate this.

0 讨论(0)
发布评论:

提交评论
- 加载中...
我寻月下人不归

2020-12-22 17:34
Take a look at the excellent Pitfalls of Object-Oriented Programming slides for some info about restructuring code for locality. In my experience getting better locality is almost always the biggest win.

General process:
- Learn to love the Disassembly View in your debugger, or have your build system generate the intermediate assembly files (.s) if at all possible. Keep an eye on changes or for things that look egregious -- even without familiarity with a given instruction set architecture, you should be able to see some things fairly clearly! (I sometimes check in a series of .s files with corresponding .cpp/.c changes, just to leverage the lovely tools from my SCM to watch the code and corresponding asm change over time.)
- Get a profiler that can watch your CPU's performance counters, or can at least guess at cache misses. (AMD CodeAnalyst, cachegrind, vTune, etc.)
Some other specific things:
- Understand strict aliasing. Once you do, make use of restrict if your compiler has it. (Examine the disasm here too!)
- Check out different floating point modes on your processor and compiler. If you don't need the denormalized range, choosing a mode without this can result in better performance. (It sounds like you've already done some things in this area, based on your discussion of rounding modes.)
- Definitely avoid allocs: call reserve on std::vector when you can, or use std::array when you know the size at compile-time.
- Use memory pools to increase locality and decrease alloc/free overhead; also to ensure cacheline alignment and prevent ping-ponging.
- Use frame allocators if you're allocating things in predictable patterns, and can afford to deallocate everything in one go.
- Do be aware of invariants. Something you know is invariant may not be to the compiler, for example a use of a struct or class member in a loop. I find the single easiest way to fall into the correct habit here is to give a name to everything, and prefer to name things outside of loops. E.g. const int threshold = m_currentThreshold; or perhaps Thing * const pThing = pStructHoldingThing->pThing; Fortunately you can usually see things that need this treatment in the disassembly view. This also helps with debugging later (makes the watch/locals window behave much more nicely in debug builds)!
- Avoid writes in loops if possible -- accumulate first, then write, or batch a few writes together. YMMV, of course.
WRT your std::priority_queue question: inserting things into a vector (the default backend for a priority_queue) tends to move a lot of elements around. If you can break up into phases, where you insert data, then sort it, then read it once it's sorted, you'll probably be a lot better off. Although you'll definitely lose locality, you may find a more self-ordering structure like a std::map or std::set worth the overhead -- but this is really dependent on your usage patterns.
0 讨论(0)
发布评论:

提交评论
- 加载中...
自闭症患者

2020-12-22 17:34
i'm surprised no one has mentioned these two:
- Link time optimization clang and g++ from 4.5 on support link time optimizations. I've heard that on g++ case, the heuristics is still pretty inmature but it should improve quickly since the main architecture is laid out.
  
  Benefits range from inter procedural optimizations at object file level, including highly sought stuff like inling of virtual calls (devirtualization)
- Project inlining this might seem to some like very crude approach, but it is that very crudeness which makes it so powerful: this amounts at dumping all your headers and .cpp files into a single, really big .cpp file and compile that; basically it will give you the same benefits of link-time optimization in your trip back to 1999. Of course, if your project is really big, you'll still need a 2010 machine; this thing will eat your RAM like there is no tomorrow. However, even in that case, you can split it in more than one no-so-damn-huge .cpp file
0 讨论(0)
发布评论:

提交评论
- 加载中...
心在旅途

2020-12-22 17:36

If you work on big matrices for instance, consider tiling your loops to improve the locality. This often leads to dramatic improvements. You can use VTune/PTU to monitor the L2 cache misses.

0 讨论(0)
发布评论:

提交评论
- 加载中...
南旧

2020-12-22 17:38

Here is a nice paper on the subject.

0 讨论(0)
发布评论:

提交评论
- 加载中...
情深已故

2020-12-22 17:38
Here is something that worked for me once. I can't say that it will work for you. I had code on the lines of
```
switch(num) {
   case 1: result = f1(param); break;
   case 2: result = f2(param); break;
   //...
}
```
Then I got a serious performance boost when I changed it to
```
// init:
funcs[N] = {f1, f2 /*...*/};
// later in the code:
result = (funcs[num])(param);
```
Perhaps someone here can explain the reason the latter version is better. I suppose it has something to do with the fact that there are no conditional branches there.
0 讨论(0)
发布评论:

提交评论
- 加载中...