Calculating performance of CUFFT

*爱你&永不变心* 提交于 2019-11-28 12:58:41

If you are doing a complex transform, the operation count is correct (it should be 2.5 N log2(N) for a real valued transform), but the GFLOP formula is incorrect. In a parallel, multiprocessor operation the usual calculation of throughput is

operation count / wall clock time

In your case, presuming the GPUs are operating in parallel, either measure the wall clock time (ie. how long the whole operation took) for the execution time, or use this:

execution time = max(memcpyHtoD + kernel + memcpyDtoH times for row and col FFT for each GPU)

As it stands, your calculation represents the serial execution time. Allowing for the overheads from the multigpu scheme, I would expect that the calculated performance numbers you are getting will be lower than the equivalent transform done on a single GPU.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!