Calculating achieved bandwidth and flops/Gflops, and evaluate CUDA kernel performance

二次信任 提交于 2019-12-03 22:45:08

Nsight Visual Studio Edition 2.1 and Above

The information you requested is available if you collect Achieved FLOPS experiment and Memory Statistics - Buffers experiment.

Visual Profiler 4.2 and Above

Achieved Bandwidth: When mouse over a kernel in the Timeline this information the information is available in the Properties Pane under Memory\DRAM Utilization.

The profiler cannot collect FLOPS count yet. This can be done by running cuobjdump -sass to view the assembly code. Step through the kernel and count single and double precision floating points instructions multiplying FMA and DFMA operations by 2. Each instruction should also be multiplied by the predicated true threads. You also have to account for control flow. This is not fun and requires someone with a strong knowlege of the instruction set. This may be better accomplished by single stepping the assembly in the debugger. The duration of the kernel is available in the Visual Profiler Properties Pane and Details Pane as Duration.

You could follow the calculations of Mark Harris in Optimizing Parallel Reductions in CUDA. There he uses the input data as base and divides it through the time of the kernel execution. In the examples he used 2^22 ints so he has 0,016777216 GB of input data. The first kernel took 8,054 ms which is an achieved bandwidth of 2,083 GB/s.

After several optimizations he approached 62,671 GB/s and compares it to the peak performance of the used GPU which is at 86,4 GB/s.

Although he used ints you can easily adapt that to flops/Gflops.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!