Let\'s say I have several threads and they access memory at addresses A+0, A+4, A+8, A+12 (each access = next thread). Such access is coalesced, right?
However if I
It's also worth noting that a main purpose of the L2 cache in an Nvidia GPU is to collapse reads and coalesce writes. So if one warp was accessing
thread 0 -> A+0
thread 1 -> A+8
thread 2 -> A+16
thread 3 -> A+24
...
and another warp was accessing
thread 0 -> A+4
thread 1 -> A+12
thread 2 -> A+20
thread 3 -> A+28
...
these two accesses will not coalesce inside the SM but generally will coalesce in the L2 cache, so that GPU memory will only be touched once.