How to use coalesced memory access
I have 'N' threads to perform simultaneously on device which they need M*N float from the global memory. What is the correct way to access the global memory coalesced? In this matter, how the shared memory can help? Usually, a good coalesced access can be achieved when the neighbouring threads access neighbouring cells in memory. So, if tid holds the index of your thread, then accessing: arr[tid] --- gives perfect coalescence arr[tid+5] --- is almost perfect, probably misaligned arr[tid*4] --- is not that good anymore, because of the gaps arr[random(0..N)] --- horrible! I am talking from the