CUDA reduction - basics

问题

I'm trying to sum an array with this code and I am stuck. I probably need some "CUDA for dummies tutorial", because I spent so much time with such basic operation and I can't make it work.

Here is a list of things I don't understand or I'm unsure of:

What number of blocks (dimGrid) should I use? I think that should be N/dimBlock.x/2 (N=length of input array), because at the beginning of the kernel, data are loaded and added to shared memory from two "blocks" of global memory
In original code there was blockSize. I replaced it with blockDim.x because I don't know how these variables differ. But when blockSize = blockDim.x, then gridSize = blockDim.x*2*gridDim.x doesn't make sense to me - gridSize will be greater than N. What is the difference between *Dim.x and *Size in a context of 1D array?
Main logic - in kernel, each block sums 2*dimBlock(threads in block) numbers. When N = 262144 and dimBlock = 128, kernel returns 1024 array of partial sums. Then I run kernel again, result = 4 partial sums. Finally, in last run, single sum is returned, because array is processed by single block.
I sum binary array. In the first run, I can use uchar4 for input data. In second and third run, I will use int.

Tell me please what am I missing

Thanks

__global__ void sum_reduction(uchar4* g_idata, int* g_odata, int N) { 

extern __shared__ int s_data[]; 

unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockDim.x*2) + tid;
unsigned int gridSize = blockDim.x*2*gridDim.x;

while (i < N) {
    s_data[tid] += g_idata[i].x + g_idata[i+blockDim.x].x +
            g_idata[i].y + g_idata[i+blockDim.x].y +
            g_idata[i].z + g_idata[i+blockDim.x].z +
            g_idata[i].w + g_idata[i+blockDim.x].w;
    i += gridSize;
}
__syncthreads();

if (tid < 64) {
    s_data[tid] += s_data[tid + 64];
}
__syncthreads(); 

if (tid < 32) { 
    volatile int *s_ptr = s_data; 
    s_ptr[tid] += s_ptr[tid + 32];
    s_ptr[tid] += s_ptr[tid + 16];
    s_ptr[tid] += s_ptr[tid + 8]; 
    s_ptr[tid] += s_ptr[tid + 4];
    s_ptr[tid] += s_ptr[tid + 2]; 
    s_ptr[tid] += s_ptr[tid + 1]; 
} 
if (tid == 0) {
    g_odata[blockIdx.x] = s_data[0];
} 
}


main{
...
dim3 dimBlock(128);
dim3 dimGrid(N/dimBlock.x);
sum_reduction<<<dimGrid, dimBlock>>>(in, out, N);
...
}

回答1:

Calling the kernel like this fixes the problem.

dim3 dimBlock(128);
dim3 dimGrid(N/dimBlock.x);
int smemSize = dimBlock.x * sizeof(int);
sum_reduction<<<dimGrid, dimBlock, smemSize>>>(in, out, N);

回答2:

Okay, I think you need to start fresh. Take a look into this step-by-step process guide from NVIDiA on reduction

来源：https://stackoverflow.com/questions/11101539/cuda-reduction-basics

标签

cuda

reduction