I\'m working on scientific code that is very performance-critical. An initial version of the code has been written and tested, and now, with profiler in hand, it\'s time to
If you work on big matrices for instance, consider tiling your loops to improve the locality. This often leads to dramatic improvements. You can use VTune/PTU to monitor the L2 cache misses.