Effective optimization strategies on modern C++ compilers

后端未结

关注

 19  2057

I\'m working on scientific code that is very performance-critical. An initial version of the code has been written and tested, and now, with profiler in hand, it\'s time to

相关标签:

19条回答

醉酒成梦

2020-12-22 17:23

here is some stuff I had used:

templates to specialize innermost loops bounds (makes them really fast)
use __restrict__ keywords for alias problems
reserve vectors beforehand to sane defaults.
avoid using map (it can be really slow)
vector append/ insert can be significantly slow. If that is the case, raw operations may make it faster
N-byte memory alignment (Intel has pragma aligned, http://www.intel.com/software/products/compilers/docs/clin/main_cls/cref_cls/common/cppref_pragma_vector.htm)
trying to keep memory within L1/L2 caches.
compiled with NDEBUG
profile using oprofile, use opannotate to look for specific lines (stl overhead is clearly visible then)

here are sample parts of profile data (so you know where to look for problems)

 * Output annotated source file with samples
 * Output all files
 *
 * CPU: Core 2, speed 1995 MHz (estimated)
--
 * Total samples for file : "/home/andrey/gamess/source/blas.f"
 *
 * 1020586 14.0896
--
 * Total samples for file : "/home/andrey/libqc/rysq/src/fock.cpp"
 *
 * 962558 13.2885
--
 * Total samples for file : "/usr/include/boost/numeric/ublas/detail/matrix_assign.hpp"
 *
 * 748150 10.3285

--
 * Total samples for file : "/usr/include/boost/numeric/ublas/functional.hpp"
 *
 * 639714  8.8315
--
 * Total samples for file : "/home/andrey/gamess/source/eigen.f"
 *
 * 429129  5.9243
--
 * Total samples for file : "/usr/include/c++/4.3/bits/stl_algobase.h"
 *
 * 411725  5.6840
--

example of code from my project

template<int ni, int nj, int nk, int nl>
inline void eval(const Data::density_type &D, const Data::fock_type &F,
                 const double *__restrict Q, double scale) {

    const double * __restrict Dij = D[0];
    ...
    double * __restrict Fij = F[0];
    ...

    for (int l = 0, kl = 0, ijkl = 0; l < nl; ++l) {
        for (int k = 0; k < nk; ++k, ++kl) {
            for (int j = 0, ij = 0; j < nj; ++j, ++jk, ++jl) {
                for (int i = 0; i < ni; ++i, ++ij, ++ik, ++il, ++ijkl) {

0 讨论(0)

太阳男子

2020-12-22 17:25
One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop. Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand?

On some compilers this is incorrect. The compiler has perfect knowledge of all code across all translation units (including static libraries) and can optimize the code the same way it would do if it were in a single translation unit. A few ones that support this feature come to my mind:
- Microsoft Visual C++ compilers
- Intel C++ Compiler
- LLVC-GCC
- GCC (I think, not sure)
0 讨论(0)
发布评论:

提交评论
- 加载中...
伪装坚强ぢ

2020-12-22 17:25

My current project is a media server, with multi thread processing (C++ language). It's a time critical application, once low performance functions could cause bad results on media streaming like lost of sync, high latency, huge delays and so.

The strategy i usually use to grantee the best performance possible is to minimize the amount of heavy operational system calls that allocate or manage resources like memory, files, sockets and so.

At first i wrote my own STL, network and file manage classes.

All my containers classes ("MySTL") manage their own memory blocks to avoid multiple alloc (new) / free (delete) calls. The objects released are enqueued on a memory block pool to be reused when needed. On that way i improve performance and protect my code against memory fragmentation.

The parts of the code that need to access lower performance system resources (like files, databases, script, network write) i use separate threads for them. But not one thread for each unit (like not 1 thread for each socket), if so the operational system would lose performance while managing a high number of threads. So you can group objects of same classes to be processed on a separate thread if possible.

For example, if you have to write data to a network socket, but the socket write buffer is full, i save the data on a sendqueue buffer (which shares memory with all sockets together) to be sent on a separate thread as soon as the sockets become writeable again. At this way your main threads should never stop processing on a blocked state waiting for the operational system frees a specific resource. All the buffers released are saved and reused when needed.

After all a profile tool would be welcome to look for program bottles and shows which algorithms should be improved.

i got succeeded using that strategy once i have servers running like 500+ days on a linux machine without rebooting, with thousands users logging everyday.

[02:01] -alpha.ip.tv- Uptime: 525days 12hrs 43mins 7secs

0 讨论(0)
发布评论:

提交评论
- 加载中...
不知归路

2020-12-22 17:26

If you are doing heavy floating point math you should consider using SSE to vectorize your computations if that maps well to your problem.

Google SSE intrinsics for more information about this.

0 讨论(0)
发布评论:

提交评论
- 加载中...
忘掉有多难

2020-12-22 17:27

Is there any benefit to replacing STL containers/algorithms with hand-rolled ones? In particular, my program includes a very large priority queue (currently a std::priority_queue) whose manipulation is taking a lot of total time. Is this something worth looking into, or is the STL implementation already likely the fastest possible?

The STL is generally the fastest, general case. If you have a very specific case, you might see a speed-up with a hand-rolled one. For example, std::sort (normally quicksort) is the fastest general sort, but if you know in advance that your elements are virtually already ordered, then insertion sort might be a better choice.

Along similar lines, for std::vectors whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?

This depends on where you are going to do the static allocation. One thing I tried along this line was to static allocate a large amount of memory on the stack, then re-use later. Results? Heap memory was substantially faster. Just because an item is on the stack doesn't make it faster to access- the speed of stack memory also depends on things like cache. A statically allocated global array may not be any faster than the heap. I assume that you have already tried techniques like just reserving the upper bound. If you have a lot of vectors that have the same upper bound, consider improving cache by having a vector of structs, which contain the data members.

I've found that dynamic memory allocation is often a severe bottleneck, and that eliminating it can lead to significant speedups. As a consequence I'm interesting in the performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference. Is there a way to reliably determine whether or not the compiler will use RVO for a given method (assuming the caller doesn't need to modify the result, of course)?

I personally normally pass the result in by reference in this scenario. It allows for a lot more re-use. Passing large data structures by value and hoping that the compiler uses RVO is not a good idea when you can just manually use RVO yourself.

How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?

I found that they weren't particularly cache-aware. The issue is that the compiler doesn't understand your program and can't predict the vast majority of it's state, especially if you depend heavily on heap. If you have a profiler that ships with your compiler, for example Visual Studio's Profile Guided Optimization, then this can produce excellent speedups.

Given the scientific nature of the program, floating-point numbers are used everywhere. A significant bottleneck in my code used to be conversions from floating point to integers: the compiler would emit code to save the current rounding mode, change it, perform the conversion, then restore the old rounding mode --- even though nothing in the program ever changed the rounding mode! Disabling this behavior significantly sped up my code. Are there any similar floating-point-related gotchas I should be aware of?

There are different floating-point models - Visual Studio gives an fp:fast compiler setting. As for the exact effects of doing such, I can't be certain. However, you could try altering the floating point precision or other settings in your compiler and checking the result.

One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop. Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand?

I've never come across such a scenario. However, if you're genuinely concerned about such, then the option remains to do it manually. One of the things that you could try is calling a function on a const reference, suggesting to the compiler that the value won't change.

One of the other things that I want to point out is the use of non-standard extensions to the compiler, for example provided by Visual Studio is __assume. http://msdn.microsoft.com/en-us/library/1b3fsfxw(VS.80).aspx

There's also multithread, which I would expect you've gone down that road. You could try some specific opts, like another answer suggested SSE.

Edit: I realized that a lot of the suggestions I posted referenced Visual Studio directly. That's true, but, GCC almost certainly provides alternatives to the majority of them. I just have personal experience with VS most.

0 讨论(0)
发布评论:

提交评论
- 加载中...
星月不相逢

2020-12-22 17:29
Is there any benefit to replacing STL containers/algorithms with hand-rolled ones? In particular, my program includes a very large priority queue (currently a std::priority_queue) whose manipulation is taking a lot of total time. Is this something worth looking into, or is the STL implementation already likely the fastest possible?

I assume you're aware that the STL containers rely on copying the elements. In certain cases, this can be a significant loss. Store pointers and you may see an increase in performance if you do a lot of container manipulation. On the other hand, it may reduce cache locality and hurt you. Another option is to use specialized allocators.

Certain containers (e.g. map, set, list) rely on lots of pointer manipulation. Although counterintuitive, it can often lead to faster code to replace them with vector. The resulting algorithm might go from O(1) or O(log n) to O(n), but due to cache locality it can be much faster in practice. Profile to be sure.

You mentioned you're using priority_queue, which I would imagine pays a lot for rearranging the elements, especially if they're large. You can try switching the underlying container (maybe deque or specialized). I'd almost certainly store pointers - again, profile to be sure.

Along similar lines, for a std::vectors whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?

Again, this may help a small amount, depending on the use case. You can avoid the heap allocation, but only if you don't need your array to outlive the stack... or you could reserve() the size in the vector so there is less copying on reallocation.

I've found that dynamic memory allocation is often a severe bottleneck, and that eliminating it can lead to significant speedups. As a consequence I'm interesting in the performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference. Is there a way to reliably determine whether or not the compiler will use RVO for a given method (assuming the caller doesn't need to modify the result, of course)?

You could look at the generated assembly to see if RVO is applied, but if you return pointer or reference, you can be sure there's no copy. Whether this will help is dependent on what you're doing - e.g. can't return references to temporaries. You can use arenas to allocate and reuse objects, so not to pay a large heap penalty.

How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?

I've seen dramatic (seriously dramatic) speedups in this realm. I saw more improvements from this than I later saw from multithreading my code. Things may have changed in the five years since - only one way to be sure - profile.

On the flip side, are there any techniques I should avoid because they are likely to interfere with the compiler's ability to automatically optimize code?
- Use explicit on your single argument constructors. Temporary object construction and destruction may be hidden in your code.
- Be aware of hidden copy constructor calls on large objects. In some cases, consider replacing with pointers.
- Profile, profile, profile. Tune areas that are bottlenecks.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 3 4 下一页