I know that I can use gprof to benchmark my code.
However, I have this problem -- I have a smart pointer that has an extra level of indirection (think of it as a prox
My advice would be to use PTU (Performance Tuning Utility) from Intel.
This utility is the direct descendant of VTune and provide the best available sampling profiler available. You'll be able to track where the CPU is spending or wasting time (with the help of the available hardware events), and this with no slowdown of your application or perturbation of the profile. And of course you'll be able to gather all cache line misses events you are looking for.