What is \"coalesced\" in CUDA global memory transaction? I couldn\'t understand even after going through my CUDA guide. How to do it? In CUDA programming guide matrix exampl
Memory coalescing is a technique which allows optimal usage of the global memory bandwidth. That is, when parallel threads running the same instruction access to consecutive locations in the global memory, the most favorable access pattern is achieved.

The example in Figure above helps explain the coalesced arrangement:
In Fig. (a), n vectors of length m are stored in a linear fashion. Element i of vector j is denoted by v j i. Each thread in GPU kernel is assigned to one m-length vector. Threads in CUDA are grouped in an array of blocks and every thread in GPU has a unique id which can be defined as indx=bd*bx+tx, where bd represents block dimension, bx denotes the block index and tx is the thread index in each block.
Vertical arrows demonstrate the case that parallel threads access to the first components of each vector, i.e. addresses 0, m, 2m... of the memory. As shown in Fig. (a), in this case the memory access is not consecutive. By zeroing the gap between these addresses (red arrows shown in figure above), the memory access becomes coalesced.
However, the problem gets slightly tricky here, since the allowed size of residing threads per GPU block is limited to bd. Therefore coalesced data arrangement can be done by storing the first elements of the first bd vectors in consecutive order, followed by first elements of the second bd vectors and so on. The rest of vectors elements are stored in a similar fashion, as shown in Fig. (b). If n (number of vectors) is not a factor of bd, it is needed to pad the remaining data in the last block with some trivial value, e.g. 0.
In the linear data storage in Fig. (a), component i (0 ≤ i < m) of vector indx
(0 ≤ indx < n) is addressed by m × indx +i; the same component in the coalesced
storage pattern in Fig. (b) is addressed as
(m × bd) ixC + bd × ixB + ixA,
where ixC = floor[(m.indx + j )/(m.bd)]= bx, ixB = j and ixA = mod(indx,bd) = tx.
In summary, in the example of storing a number of vectors with size m, linear indexing is mapped to coalesced indexing according to:
m.indx +i −→ m.bd.bx +i .bd +tx
This data rearrangement can lead to a significant higher memory bandwidth of GPU global memory.
source: "GPU‐based acceleration of computations in nonlinear finite element deformation analysis." International journal for numerical methods in biomedical engineering (2013).