Which memory access pattern is more efficient for a cached GPU?

耗尽温柔 提交于 2019-12-05 10:58:23
Kurt

I think the answer depends on whether or not the accesses are to global or local memory. If you are pulling the data from global memory, then you need to worry about coalescing the reads (ie contiguous blocks, second example). However, if you are pulling the data from local memory, then you need to worry about bank conflicts. I have some but not a lot of experience, so I'm not stating this as absolute truth.

Edit: After reading up on GCN, I don't think the caches make a difference here. You can basically think of them as just speeding up global memory if you repeatedly read/write the same elements. On a side note, thanks for asking the question, because reading up on the new architecture is pretty interesting.

Edit 2: Here's a nice Stack Overflow discussion of banks for local and global memory: Why aren't there bank conflicts in global memory for Cuda/OpenCL?

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!