How to decrease the number of possible cache misses when designing a C++ program?
Does inlining functions help every time? or is it good only when the program is CPU
Allow CPU to prefetch data efficiently. For example you can decrease number cache misses processing multi-dimensional arrays by rows rather than by columns, unroll loops etc.
This kind of optimization depends on hardware architecture, so you better use some kind of platform-specific profiler like Intel VTune to detect possible problems with cache.