The size of the shared memory (\"local memory\" in OpenCL terms) is only 16 KiB on most nVIDIA GPUs of today.
I have an application in which I need to create an array th
You can try to use cudaFuncSetCacheConfig(nameOfKernel, cudaFuncCachePrefer{Shared, L1})
function.
If you prefer L1 to Shared, then 48KB will go to L1 and 16KB will go to Shared. If you prefer Shared to L1, then 48KB will go to Shared and 16KB will go to L1.
Usage:
cudaFuncSetCacheConfig(matrix_multiplication, cudaFuncCachePreferShared);
matrix_multiplication<<>>(bla, bla, bla);