问题
I have a the following code:
__global__ void interpolation(const double2* __restrict__ data, double2* __restrict__ result, const double* __restrict__ x, const double* __restrict__ y, const int N1, const int N2, int M)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
[...]
double phi_cap1, phi_cap2;
if(i<M) {
for(int m=0; m<(2*K+1); m++) {
[calculate phi_cap1];
for(int n=0; n<(2*K+1); n++) {
[calculate phi_cap2];
[calculate phi_cap=phi_cap1*phi_cap2];
[use phi_cap];
}
}
}
}
I would like to use Dynamic Programming on a Kepler K20 card to dispatch the processing of phi_cap1 and phi_cap2 in parallel to a bunch of threads to reduce the computation time. K=6 in my code, so I'm launching a single block of 13x13 threads.
Following the CUDA Dynamic Parallelism Programming Guide, I'm allocating a matrix phi_cap of 169 elements (formed by the products of phi_cap1 and phi_cap2), needed to exchange the data with the child kernel, in global memory. Indeed, quoting the guide,
As a general rule, all storage passed to a child kernel should be allocated explicitly from the global-memory heap.
I then ended-up with the following code
__global__ void interpolation(const double2* __restrict__ data, double2* __restrict__ result, const double* __restrict__ x, const double* __restrict__ y, const int N1, const int N2, int M)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
[...]
dim3 dimBlock(2*K+1,2*K+1); dim3 dimGrid(1,1);
if(i<M) {
double* phi_cap; cudaMalloc((void**)&phi_cap,sizeof(double)*(2*K+1)*(2*K+1));
child_kernel<<<dimGrid,dimBlock>>>(cc_diff1,cc_diff2,phi_cap);
for(int m=0; m<(2*K+1); m++) {
for(int n=0; n<(2*K+1); n++) {
[use phi_cap];
}
}
}
}
The problem is that the first routine takes 5ms to run, while the second routine, even by commenting the child_kernel launch, takes 23ms, with practically all the time spent in the cudaMalloc API.
Since in dynamic programming one would often need allocating memory space to exchange data with the child kernels, and the only solution seems to be global memory taking so much time, it seems to me that one serious bottleneck of the usefulness of dynamic programming is the data exchange, unless there is a way to circumvent the global memory allocation issue.
The question then is: is there any workaround to the mentioned issue, namely, taking so much time when allocating global memory from within a kernel?. Thanks
SOLUTION PROPOSED IN THE COMMENTS
Allocate the required global memory from outside the parent kernel. I have verified that allocating the required global memory from outside the parent kernel is much faster.
回答1:
You are calling cudaMalloc from each thread where i < M which means that you are making M cudaMalloc calls.
The bigger M is the worse it is going to get.
Instead you could make a single cudaMalloc call from the first thread of the block allocating M times the size that you used before (actually in your case you should allocate more, so each block is properly aligned). After that sync the threads and you can start your child kernels with correctly computed phi_cap address for each child kernel.
Alternatively (if your specific situation allows you to allocate enough memory that you can hold on to between the kernel calls) you could allocate the memory once outside of the kernel and reuse it. That would be a lot quicker. If M varies between kernel calls you could allocate as much as you would need for the biggest M.
来源:https://stackoverflow.com/questions/14855408/dynamic-programming-in-cuda-global-memory-allocations-to-exchange-data-with-chi