In CUDA profiler nvvp, what does the “Shared/Global Memory Replay Overhead” mean? How is it computed?

不打扰是莪最后的温柔 提交于 2019-12-10 13:26:24

问题


When we use CUDA profiler nvvp, there are several "overhead"s correlated with instructions, for example:

  • Branch Divergence Overhead;
  • Shared/Global Memory Replay Overhead; and
  • Local/Global Cache Replay Overhead.

My Questions are:

  1. What cause(s) these overheads?And
  2. how are they computed?
  3. Similarly, how are Global Load/Store Efficiency computed?

Attachment: I've found all the formulas computing these overheads in the 'CUDA Profiler Users Guide' packed in CUDA5 toolkit.


回答1:


You can find some of the answers to your question here:

Why does CUDA Profiler indicate replayed instructions: 82% != global replay + local replay + shared replay?

Replayed Instructions (%) This gives the percentage of instructions replayed during kernel execution. Replayed instructions are the difference between the numbers of instructions that are actually issued by the hardware to the number of instructions that are to be executed by the kernel. Ideally this should be zero. This is calculated as 100 * (instructions issued - instruction executed) / instruction issued

Global memory replay (%) Percentage of replayed instructions caused due to global memory accesses. This is calculated as 100 * (l1 global load miss) / instructions issued

Local memory replay (%) Percentage of replayed instructions caused due to local memory accesses. This is calculated as 100 * (l1 local load miss + l1 local store miss) / instructions issued

Shared bank conflict replay (%) Percentage of replayed instructions caused due to shared memory bank conflicts. This is calculated as 100 * (l1 shared conflict)/ instructions issued



来源:https://stackoverflow.com/questions/13551923/in-cuda-profiler-nvvp-what-does-the-shared-global-memory-replay-overhead-mean

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!