问题
I'm trying to implement a linked list in a CUDA application to model a growing network. In oder to do so I'm using malloc
inside the __device__
function, aiming to allocate memory in the global memory.
The code is:
void __device__ insereviz(Vizinhos **lista, Nodo *novizinho, int *Gteste)
{
Vizinhos *vizinho;
vizinho=(Vizinhos *)malloc(sizeof(Vizinhos));
vizinho->viz=novizinho;
vizinho->proxviz=*lista;
*lista=vizinho;
novizinho->k=novizinho->k+1;
}
After a certain number of allocated elements (around 90000) my program returns "unknown error". At first I though it was a memory constraint, but I checked nvidia-smi
and I've got
+------------------------------------------------------+
| NVIDIA-SMI 331.38 Driver Version: 331.38 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 770 Off | 0000:01:00.0 N/A | N/A |
| 41% 38C N/A N/A / N/A | 159MiB / 2047MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
So it doesn't seem a memory problem, unless malloc
is allocating inside the shared memory. To test this I've tried to run two networks in separated blocks, and still have a limitation in the number of structures I'm able to allocate. But when I try to run two instances of the same program with a smaller number of structures they both finish without error.
I also have tried cuda-memcheck
and got
========= CUDA-MEMCHECK
========= Invalid __global__ write of size 8
========= at 0x000001b0 in /work/home/melo/proj_cuda/testalloc/cuda_testamalloc.cu:164:insereviz(neighbor**, node*, int*)
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x00000000 is out of bounds
========= Device Frame:/work/home/melo/proj_cuda/testalloc/cuda_testamalloc.cu:142:insereno(int, int, node**, node**, int*) (insereno(int, int, node**, node**, int*) : 0x648)
========= Device Frame:/work/home/melo/proj_cuda/testalloc/cuda_testamalloc.cu:111:fazrede(node**, int, int, int, int*) (fazrede(node**, int, int, int, int*) : 0x4b8)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/libcuda.so.1 (cuLaunchKernel + 0x331) [0x138281]
========= Host Frame:gpu_testamalloc5 [0x1bd48]
========= Host Frame:gpu_testamalloc5 [0x3b213]
========= Host Frame:gpu_testamalloc5 [0x2fe3]
========= Host Frame:gpu_testamalloc5 [0x2e39]
========= Host Frame:gpu_testamalloc5 [0x2e7f]
========= Host Frame:gpu_testamalloc5 [0x2c2f]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xfd) [0x1eead]
========= Host Frame:gpu_testamalloc5 [0x2829]
Is there any restriction in the kernel launch or something I'm missing? How can I check it?
Thank you,
Ricardo
回答1:
The most likely reason is that you are running out of space on the "device heap". This is initially defaulting to 8MB, but you can change it.
Referring to the documentation, we see that device malloc
allocates out of the device heap.
If an error occurs, a NULL pointer will be returned by malloc
. It's good practice to test for this NULL pointer in device code (and in host code -- it's no different from host malloc
in this respect). If you get a NULL pointer, you have run out of device heap space.
As indicated in the documentation, the size of the device heap can be adjusted before your kernel call by using the:
cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size)
runtime API function.
If you ignore all this and attempt to use the NULL pointer returned anyway, you'll get invalid accesses in device code, like this:
========= Address 0x00000000 is out of bounds
来源:https://stackoverflow.com/questions/23916093/unknown-error-while-using-dynamic-allocation-inside-device-function-in-cud