When is padding for shared memory really required?
I am confused by 2 documents from NVidia. "CUDA Best Practices" describes that shared memory is organized in banks, and in general in 32-bit mode each 4 bytes is a bank (that is how I understood it). However "Parallel Prefix Sum (Scan) with CUDA" (available here: http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html ) goes into details how padding should be added to scan algorithm because of bank conflicts. The problem for me is, the basic type for this algorithm as presented is float and its size is 4 bytes. Thus each float is a bank and there is no bank conflict. So is my