Here is my code:
int threadNum = BLOCKDIM/8; dim3 dimBlock(threadNum,threadNum); int blocks1 = nWidth/threadNum + (nWidth%threadNum == 0 ? 0 : 1); int blocks
Just to add to the previous answers, you can find the max threads allowed in your code also, so it can run in other devices without hard-coding the number of threads you will use:
struct cudaDeviceProp properties; cudaGetDeviceProperties(&properties, device); cout<<"using "<