CUDA/C - Using malloc in kernel functions gives strange results

问题

I'm new to CUDA/C and new to stack overflow. This is my first question.

I'm trying to allocate memory dynamically in a kernel function, but the results are unexpected. I read using malloc() in a kernel can lower performance a lot, but I need it anyway so I first tried with a simple int ** array just to test the possibility, then I'll actually need to allocate more complex structs.

In my main I used cudaMalloc() to allocate the space for the array of int *, and then I used malloc() for every thread in the kernel function to allocate the array for every index of the outer array. I then used another thread to check the result, but it doesn't always work.

Here's main code:

#define N_CELLE 1024*2
#define L_CELLE 512

extern "C" {

int main(int argc, char **argv) {
  int *result = (int *)malloc(sizeof(int));
  int *d_result;
  int size_numbers = N_CELLE * sizeof(int *);
  int **d_numbers;

  cudaMalloc((void **)&d_numbers, size_numbers);
  cudaMalloc((void **)&d_result, sizeof(int *));

  kernel_one<<<2, 1024>>>(d_numbers);
  cudaDeviceSynchronize();
  kernel_two<<<1, 1>>>(d_numbers, d_result);

  cudaMemcpy(result, d_result, sizeof(int), cudaMemcpyDeviceToHost);

  printf("%d\n", *result);

  cudaFree(d_numbers);
  cudaFree(d_result);
  free(result);
}

}

I used extern "C"because I could't compile while importing my header, which is not used in this example code. I pasted it since I don't know if this may be relevant or not.

This is kernel_one code:

__global__ void kernel_one(int **d_numbers) {
  int i = threadIdx.x + blockIdx.x * blockDim.x;
  d_numbers[i] = (int *)malloc(L_CELLE*sizeof(int));
  for(int j=0; j<L_CELLE;j++)
    d_numbers[i][j] = 1;
}

And this is kernel_two code:

__global__ void kernel_two(int **d_numbers, int *d_result) {
  int temp = 0;
  for(int i=0; i<N_CELLE; i++) {
    for(int j=0; j<L_CELLE;j++)
      temp += d_numbers[i][j];     
  }
  *d_result = temp;
}

Everything works fine (aka the count is correct) until I use less than 1024*2*512 total blocks in device memory. For example, if I #define N_CELLE 1024*4 the program starts giving "random" results, such as negative numbers. Any idea of what the problem could be? Thanks anyone!

回答1:

In-kernel memory allocation draws memory from a statically allocated runtime heap. At larger sizes, you are exceeding the size of that heap and then your two kernels are attempting to read and write from uninitialised memory. This produces a runtime error on the device and renders the results invalid. You would already know this if you either added correct API error checking on the host side, or ran your code with the cuda-memcheck utility.

The solution is to ensure that the heap size is set to something appropriate before trying to run a kernel. Adding something like this:

 size_t heapsize = sizeof(int) * size_t(N_CELLE) * size_t(2*L_CELLE);
 cudaDeviceSetLimit(cudaLimitMallocHeapSize, heapsize);

to your host code before any other API calls, should solve the problem.

回答2:

I don't know anything about CUDA but these are severe bugs:

You cannot convert from int** to void**. They are not compatible types. Casting doesn't solve the problem, but hides it.
&d_numbers gives the address of a pointer to pointer which is wrong. It is of type int***.

Both of the above bugs result in undefined behavior. If your program somehow seems to works in some condition, that's just by pure (bad) luck only.

来源：https://stackoverflow.com/questions/44901330/cuda-c-using-malloc-in-kernel-functions-gives-strange-results

标签

cuda

malloc

gpgpu