Effective optimization strategies on modern C++ compilers

后端 未结 19 2045
梦如初夏
梦如初夏 2020-12-22 17:02

I\'m working on scientific code that is very performance-critical. An initial version of the code has been written and tested, and now, with profiler in hand, it\'s time to

相关标签:
19条回答
  • 2020-12-22 17:39

    Is there any benefit to replacing STL containers/algorithms with hand-rolled ones?

    Generally, not unless you're working with a poor implementation. I wouldn't replace an STL container or algorithm just because you think you can write tighter code. I'd do it only if the STL version is more general than it needs to be for your problem. If you can write a simpler version that does just what you need, then there might be some speed to gain there.

    One exception I've seen is to replace a copy-on-write std::string with one that doesn't require thread synchronization.

    for std::vectors whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?

    Unlikely. But if you're using a lot of time allocating up to a certain size, it might be profitable to add a reserve() call.

    performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference.

    When working with containers, I pass iterators for the inputs and an output iterator, which is still pretty general.

    How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?

    Not very. Yes. I find that missed branch predictions and cache-hostile memory access patterns are the two biggest killers of performance (once you've gotten to reasonable algorithms). A lot of older code uses "early out" tests to reduce calculations. But on modern processors, that's often more expensive than doing the math and ignoring the result.

    A significant bottleneck in my code used to be conversions from floating point to integers

    Yup. I recently discovered the same issue.

    One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop.

    Some compilers can deal with this. Visual C++ has a "link-time code generation" option that effective re-invokes the compiler to do further optimization. And, in the case of functions like strlen, many compilers will recognize that as an intrinsic function.

    Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand? On the flip side, are there any techniques I should avoid because they are likely to interfere with the compiler's ability to automatically optimize code?

    When you're optimizing at this low level, there are few reliable rules of thumb. Compilers will vary. Measure your current solution, and decide if it's too slow. If it is, come up with a hypothesis (e.g., "What if I replace the inner if-statements with a look-up table?"). It might help ("eliminates stalls due to failed branch predictions") or it might hurt ("look-up access pattern hurts cache coherence"). Experiment and measure incrementally.

    I'll often clone the straightforward implementation and use an #ifdef HAND_OPTIMIZED/#else/#endif to switch between the reference version and the tweaked version. It's useful for later code maintenance and validation. I commit each successful experiment to change control, and keep a log (spreadsheet) with the changelist number, run times, and explanation for each step in optimization. As I learn more about how the code behaves, the log makes it easy to back up and branch off in another direction.

    You need a framework for running reproducible timing tests and to compare results to the reference version to make sure you don't inadvertently introduce bugs.

    0 讨论(0)
  • 2020-12-22 17:39

    If I were working on this, I would expect an end-stage where things like cache locality and vector operations would come into play.

    However, before getting to the end stage, I would expect to find a series of problems of different sizes having less to do with compiler-level optimization, and more to do with odd stuff going on that could never be guessed, but once found, are simple to fix. Usually they revolve around class overdesign and data structure issues.

    Here's an example of this kind of process.

    I have found that generalized container classes with iterators, which in principle the compiler can optimize down to minimal cycles, often are not so optimized for some obscure reason. I've also heard other cases on SO where this happens.

    Others have said, before you do anything else, profile. I agree with that approach except I think there's a better way, and it's indicated in that link. Whenever I find myself asking if some specific thing, like STL, could be a problem, I just might be right - BUT - I'm guessing. The fundamental winning idea in performance tuning is find out, don't guess. It is easy to find out for sure what is taking the time, so don't guess.

    0 讨论(0)
  • 2020-12-22 17:41

    Is there any benefit to replacing STL containers/algorithms with hand-rolled ones?
    I would only consider this as a last option. The STL containers and algorithms have been thoroughly tested. Creating new ones are expensive in terms of development time.

    Along similar lines, for std::vectors whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?
    First, try reserving space for the vectors. Check out the std::vector::reserve method. A vector that keeps growing or changing to larger sizes is going to waste dynamic memory and execution time. Add some code to determine a good value for an upper bound.

    I've found that dynamic memory allocation is often a severe bottleneck, and that eliminating it can lead to significant speedups. As a consequence I'm interesting in the performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference. Is there a way to reliably determine whether or not the compiler will use RVO for a given method (assuming the caller doesn't need to modify the result, of course)?
    As a matter of principle, always pass large structures by reference or pointer. Prefer passing by constant reference. If you are using pointers, consider using smart pointers.

    How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?
    Modern compilers are very aware of instruction caches (pipelines) and try to keep them from being reloaded. You can always assist your compiler by writing code that uses less branches (from if, switch, loop constructs and function calls).

    You may see more significant performance gain by adjusting your program to optimize the data cache. Search the web for Data Driven Design. There are many excellent articles on this topic.

    Given the scientific nature of the program, floating-point numbers are used everywhere. A significant bottleneck in my code used to be conversions from floating point to integers: the compiler would emit code to save the current rounding mode, change it, perform the conversion, then restore the old rounding mode --- even though nothing in the program ever changed the rounding mode! Disabling this behavior significantly sped up my code. Are there any similar floating-point-related gotchas I should be aware of?
    For accuracy, keep everything as a double. Adjust for rounding only when necessary and perhaps before displaying. This falls under the optimization rule, Use less code, eliminate extraneous or deadwood code.

    Also see the section above about reserving space in containers before using them.

    Some processors can load and store floating point numbers either faster or as fast as integers. This would require gathering profile data before optimizing. However, if you know there is minimal resolution, you could use integers and change your base to that minimal resolution . For example, when dealing with U.S. money, integers can be used to represent 1/100 or 1/1000 of a dollar.

    One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop. Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand?
    This an incorrect assumption. Compilers can optimize based on the function's signature, especially if the parameters correctly use const. I always like to assist the compiler by moving constant stuff outside of the loop. For an upper limit value, such as a string length, assign it to a const variable before the loop. The const modifier will assist the Optimizer.

    There is always the count-down optimization in loops. For many processors, a jump on register equals zero is more efficient than compare and jump if less than.

    On the flip side, are there any techniques I should avoid because they are likely to interfere with the compiler's ability to automatically optimize code?
    I would avoid "micro optimizations". If you have any doubts, print out the assembly code generated by the compiler (for the area you are questioning) under the highest optimization setting. Try rewriting the code to express the compiler's assembly code. Optimize this code, if you can. Anything more requires platform specific instructions.

    Optimization Ideas & Concepts

    1. Computers prefer to execute sequential instructions.
    Branching upsets them. Some modern processors have enough instruction cache to contain code for small loops. When in doubt, don't cause branches.

    2. Eliminate Requirements
    Less code, more performance.

    3. Optimize designs before code Often times, more performance can be gained by changing the design versus changing the implementation of the design. Less design promotes less code, generates more performance.

    4. Consider data organization Optimize the data.
    Organize frequently used fields into substructures. Set data sizes to fit into a data cache line. Remove constant data out of data structures.
    Use const specifier as much as possible.

    5. Consider page swapping Operating systems will swap out your program or task for another one. Often times into a 'swap file' on the hard drive. Breaking up the code into chunks that contain heavily executed code and less executed code will assist the OS. Also, coagulate heavily used code into tighter units. The idea is to reduce the swapping of code from the hard drive (such as fetching "far" functions). If code must be swapped out, it should be as one unit.

    6. Consider I/O optimizations (Includes file I/O too).
    Most I/O prefers fewer large chunks of data to many small chunks of data. Hard drives like to keep spinning. Larger data packets have less overhead than smaller packets.
    Format data into a buffer then write the buffer.

    7. Eliminate the competition
    Get rid of any programs and tasks that are competing against your application for the processor(s). Such tasks as virus scanning and playing music. Even I/O drivers want a piece of the action (which is why you want to reduce the number or I/O transactions).

    These should keep you busy for a while. :-)

    0 讨论(0)
  • 2020-12-22 17:41

    The STL priority queue implementation is fairly well-optimized for what it does, but certain kinds of heaps have special properties that can improve your performance on certain algorithms. Fibonacci heaps are one example. Also, if you're storing objects with a small key and a large amount of satellite data, you'll get a major improvement in cache performance if you store that data separately, even if it means storing one extra pointer per object.

    As for arrays, I've found std::vector to even slightly out-perform compile-time-constant arrays. That said, its optimizations are general, and specific knowledge of your algorithm's access patterns may allow you to optimize further for cache locality, alignment, coloring, etc. If you find that your performance drops significantly past a certain threshold due to cache effects, hand-optimized arrays may move that problem size threshold by as much as a factor of two in some cases, but it's unlikely to make a huge difference for small inner loops that fit easily within the cache, or large working sets that exceed the size of any CPU cache. Work on the priority queue first.

    Most of the overhead of dynamic memory allocation is constant with respect to the size of the object being allocated. Allocating one large object and returning it by a pointer isn't going to hurt much as much as copying it. The threshold for copying vs. dynamic allocation varies greatly between systems, but it should be fairly consistent within a chip generation.

    Compilers are quite cache-aware when cpu-specific tuning is turned on, but they don't know the size of the cache. If you're optimizing for cache size, you may want to detect that or have the user specify it at run-time, since that will vary even between processors of the same generation.

    As for floating point, you absolutely should be using SSE. This doesn't necessarily require learning SSE yourself, as there are many libraries of highly-optimized SSE code that do all sorts of important scientific computing operations. If you're compiling 64-bit code, the compiler might emit some SSE code automatically, as SSE2 is part of the x86_64 instruction set. SSE will also save you some of the overhead of x87 floating point, since it's not converting back and forth to 80-bit values internally. Those conversions can also be a source of accuracy problems, since you can get different results from the same set of operations depending on how they get compiled, so it's good to be rid of them.

    0 讨论(0)
  • 2020-12-22 17:43

    And I think the main hint anyone could give you is: measure, measure, measure. That and improving your algorithms.
    The way you use certain language features, the compiler version, std lib implementation, platform, machine - all ply their role in performance and you haven't mentioned many of those and no one of us ever had your exact setup.

    Regarding replacing std::vector: use a drop-in replacement (e.g., this one) and just try it out.

    0 讨论(0)
  • 2020-12-22 17:46

    About STL containers.

    Most people here claim STL offers one of the fastest possible implementations of the container algorithms. And I say the opposite: for the most real-world scenarios the STL containers taken as-is yield a really catastrophic performance.

    People argue about the complexity of the algorithms used in STL. Here STL is good: O(1) for list/queue, vector (amortized), and O(log(N)) for map. But this is not the real bottleneck of the performance for a typical application! For many applications the real bottleneck is the heap operations (malloc/free, new/delete, etc.).

    A typical operation on the list costs just a few CPU cycles. On a map - some tens, may be more (this depends on the cache state and log(N) of course). And typical heap operations cost from hunders to thousands (!!!) of CPU cycles. For multithreaded applications for instance they also require synchronization (interlocked operations). Plus on some OSs (such as Windows XP) the heap functions are implemented entirely in the kernel mode.

    So that the actual performance of the STL containers in a typical scenario is dominated by the amount of heap operations they perform. And here they're disastrous. Not because they're implemented poorly, but because of their design. That is, this is the question of the design.

    On the other hand there're other containers which are designed differently. Once I've designed and written such containers for my own needs:

    http://www.codeproject.com/KB/recipes/Containers.aspx

    And it proved for me to be superior from the performance point of view, and not only.

    But recently I've discovered I'm not the only one who thought about this. boost::intrusive is the container library that is implemented in the manner similar to what I did then.

    I suggest you try it (if you didn't already)

    0 讨论(0)
提交回复
热议问题