Understanding perf detail when comparing two different implementations of a BFS algorithm

无人久伴 提交于 2019-12-02 06:57:40

Unfortunately perf stat often doesn't given enough information to really determine where the bottleneck in your application is. It is possible to have two applications with wildly different underlying bottlenecks but with very similar perf stat profiles. For example, two applications may have the same number or fraction of L2 cache misses, and yet one might be dominated by this effect and the other way may almost be not impacted at all, depending on the amount and nature of overlapping work.

So if you try to analyze in depth from these high level counters, you are often just taking stabs in the dark. Still we can make a few observations. You mention:

The optimized algorithm has a much higher last level cache miss which I thought would explain the primary issue with BFS / graph algorithms with low locality but performance seems to be unaffected by this and my unoptimized is significantly lower.

First, LLC misses are ~620 million for the optimized algorithm and ~380 for your algorithm, but you are running the optimized algorithm 10 times in this benchmark and yours only once. So the optimized algorithm has perhaps 62 million misses, and your algorithm has six times the number of LLC misses. Yes, your algorithm has a lower LLC miss rate - but the absolute number of LLC misses is what counts for performance. The lower miss rates just means that you are making every more total accesses than the 6x figure: basically you make many, many more memory accesses than the optimized version, which leads to a higher hit rate but more total misses.

All of this points to accessing more total memory in your unoptimized algorithm, or perhaps accessing it in a much more cache unfriendly fashion. That's would also explain the much higher number of page faults. Overall, both algorithms have low IPC, and yours is particularly low (0.49 IPC) and given that there aren't branch prediction problems, and that you've already identified these as graph algorithms with locality/memory access problems, stalls while waiting for memory are very likely.

Luckily, there is a better way that just trying to reverse engineer what might be the bottleneck based on perf stat output. Intel has developed a whole methodology which tries to this type of top-down analysis in a way that determines the true bottlenecks. It's not perfect, but it's far and away better than looking at the plain perf stat counters. VTune isn't free, but you can get a similar analysis based on the same methodology effect using Andi Kleen's toplev. I highly recommend you start there.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!