Note that this shared memory array is never written to, only read from.
As I have it, my shared memory gets initialized like:
__shared__ float TMshar
Use all threads to write independent locations, it will probably be quicker.
Example assumes 1D threadblock/grid:
#define SSIZE 2592 __shared__ float TMshared[SSIZE]; int lidx = threadIdx.x; while (lidx < SSIZE){ TMShared[lidx] = TM[lidx]; lidx += blockDim.x;} __syncthreads();