So the idea that I have about coalescing memory accesses in CUDA is, that threads in a warp should access contiguous memory addresses, as that will only cause a single memor
For purposes of coalescing, as you stated, you should focus on making the 32 threads in a warp access contiguous locations, preferably 32-byte or 128-byte aligned as well. Beyond that, don't worry about the physical address bus to the DRAM memory. The memory controller is composed of mostly independent partitions that are each 64bits wide. Your coalesced access coming out of the warp will be satisfied as quickly as possible by the memory controller. A single coalesced access for a full warp (32 threads) accessing an int or float will require 128 bytes to be retrieved anyway, i.e. multiple transactions on the physical bus to DRAM. When you are operating in caching mode, you can't really control the granularity of requests to global memory below 128 bytes at a time, anyway.
It's not possible to cause a single thread to request 48 bytes or anything like that in a single transaction. Even at the c code level if you think you are accessing a whole data structure at once, at the machine code level it gets converted to instructions that read 32 or 64 bits at a time.
If you feel that the caching restriction of 128 bytes at a time is penalizing your code, you can try running in uncached mode, which will reduce the granularity of global memory requests to 32 bytes at a time. If you have a scattered access pattern (not well coalesced) this option may give better performance.