Using CUDA Profiler nvprof for memory accesses

感情迁移 提交于 2019-12-02 13:20:53

It's customary to ask a question, something more specific than "Can someone help me?" Your code as shown has no floating point operations (+, *, etc.) so there is no CGMA to compute (it is zero).

Regarding the memory transactions, your code has 4 threadblocks:

 dim3 dimGrid(4,1,1);

Each threadblock may run on a separate multiprocessor. You have 10 threads in each block. The following line of code:

            d_Out[n]=d_In[n]+1;

will generate at least one global load transaction (d_In) and one global store transaction (d_Out) to service the threads. The fourth block will have threads whose global indices (n) for the active threads will be 30-35. When this block executes the above line of code, it will generate two global load and two global store transactions, because the threads require two cachelines to service their requests. So this one line of code may generate 5 global load transactions and 5 global store transactions.

For similar reasons, the next line of code:

            d_rows[n]=crows;

may generate 5 additional global store transactions. So of your profiler output:

  1                gld_transactions        Global Load Transactions           6           6           6
  1                gst_transactions       Global Store Transactions          11 

I believe I have explained 5 of the 6 global load transactions, and 10 of the 11 global store transactions. Hopefully that is enough to give you an idea of the origin of these numbers.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!