Cuda zero-copy performance

狂风中的少年 提交于 2019-12-02 04:00:00

Fermi and Kepler GPUs need to replay memory instructions for multiple reasons:

  1. The memory operation was for a size specifier (vector type) that requires multiple transactions in order to perform the address divergence calculation and communicate data to/from the L1 cache.
  2. The memory operation had thread address divergence requiring access to multiple cache lines.
  3. The memory transaction missed the L1 cache. When the miss value is returned to L1 the L1 notifies the warp scheduler to replay the instruction.
  4. The LSU unit resources are full and the instruction needs to be replayed when the resource are available.

The latency to

  • L2 is 200-400 cycles
  • device memory (dram) is 400-800 cycles
  • zero copy memory over PCIe is 1000s of cycles

The replay overhead is increasing due to the increase in misses and contention for LSU resources due to increased latency.

The global load efficiency is not increasing as it is the ratio of the ideal amount of data that would need to be transferred for the memory instructions that were executed to the actual amount of data transferred. Ideal means that the executed threads accessed sequential elements in memory starting at a cache line boundary (32-bit operation is 1 cache line, 64-bit operation is 2 cache lines, 128-bit operation is 4 cache lines). Accessing zero copy is slower and less efficient but it does not increase or change the amount of data transferred.

The profiler's exposes the following counters:

  • gld_throughput
  • l1_cache_global_hit_rate
  • dram_{read, write}_throughput
  • l2_l1_read_hit_rate

In the zero copy case all of these metrics should be much lower.

The Nsight VSE CUDA Profiler memory experiments will show the amount of data accessed over PCIe (zero copy memory).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!