Using maximum shared memory in Cuda
问题 I am unable to use more than 48K of shared memory (on V100, Cuda 10.2) I call cudaFuncSetAttribute(my_kernel, cudaFuncAttributePreferredSharedMemoryCarveout, cudaSharedmemCarveoutMaxShared); before launching my_kernel first time. I use launch bounds and dynamic shared memory inside my_kernel : __global__ void __launch_bounds__(768, 1) my_kernel(...) { extern __shared__ float2 sh[]; ... } Kernel is called like this: dim3 blk(32, 24); // 768 threads as in launch_bounds. my_kernel<<<grd, blk, 64