CUDA reduction - basics

感情迁移 提交于 2019-11-30 16:22:05

问题


I'm trying to sum an array with this code and I am stuck. I probably need some "CUDA for dummies tutorial", because I spent so much time with such basic operation and I can't make it work.

Here is a list of things I don't understand or I'm unsure of:

  1. What number of blocks (dimGrid) should I use? I think that should be N/dimBlock.x/2 (N=length of input array), because at the beginning of the kernel, data are loaded and added to shared memory from two "blocks" of global memory

  2. In original code there was blockSize. I replaced it with blockDim.x because I don't know how these variables differ. But when blockSize = blockDim.x, then gridSize = blockDim.x*2*gridDim.x doesn't make sense to me - gridSize will be greater than N. What is the difference between *Dim.x and *Size in a context of 1D array?

  3. Main logic - in kernel, each block sums 2*dimBlock(threads in block) numbers. When N = 262144 and dimBlock = 128, kernel returns 1024 array of partial sums. Then I run kernel again, result = 4 partial sums. Finally, in last run, single sum is returned, because array is processed by single block.

  4. I sum binary array. In the first run, I can use uchar4 for input data. In second and third run, I will use int.

Tell me please what am I missing

Thanks

__global__ void sum_reduction(uchar4* g_idata, int* g_odata, int N) { 

extern __shared__ int s_data[]; 

unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockDim.x*2) + tid;
unsigned int gridSize = blockDim.x*2*gridDim.x;

while (i < N) {
    s_data[tid] += g_idata[i].x + g_idata[i+blockDim.x].x +
            g_idata[i].y + g_idata[i+blockDim.x].y +
            g_idata[i].z + g_idata[i+blockDim.x].z +
            g_idata[i].w + g_idata[i+blockDim.x].w;
    i += gridSize;
}
__syncthreads();

if (tid < 64) {
    s_data[tid] += s_data[tid + 64];
}
__syncthreads(); 

if (tid < 32) { 
    volatile int *s_ptr = s_data; 
    s_ptr[tid] += s_ptr[tid + 32];
    s_ptr[tid] += s_ptr[tid + 16];
    s_ptr[tid] += s_ptr[tid + 8]; 
    s_ptr[tid] += s_ptr[tid + 4];
    s_ptr[tid] += s_ptr[tid + 2]; 
    s_ptr[tid] += s_ptr[tid + 1]; 
} 
if (tid == 0) {
    g_odata[blockIdx.x] = s_data[0];
} 
}


main{
...
dim3 dimBlock(128);
dim3 dimGrid(N/dimBlock.x);
sum_reduction<<<dimGrid, dimBlock>>>(in, out, N);
...
}

回答1:


Calling the kernel like this fixes the problem.

dim3 dimBlock(128);
dim3 dimGrid(N/dimBlock.x);
int smemSize = dimBlock.x * sizeof(int);
sum_reduction<<<dimGrid, dimBlock, smemSize>>>(in, out, N);    



回答2:


Okay, I think you need to start fresh. Take a look into this step-by-step process guide from NVIDiA on reduction



来源:https://stackoverflow.com/questions/11101539/cuda-reduction-basics

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!