CUDA: allocation of an array of structs inside a struct

I've these structs:

typedef struct neuron
{
float*  weights;
int n_weights;
}Neuron;


typedef struct neurallayer
{
Neuron *neurons;
int    n_neurons;
int    act_function;
}NLayer;

"NLayer" struct can contain an arbitrary number of "Neuron"

I've tried to allocate a 'NLayer' struct with 5 'Neurons' from the host in this way:

NLayer* nL;
int i;
int tmp=9;
cudaMalloc((void**)&nL,sizeof(NLayer));
cudaMalloc((void**)&nL->neurons,6*sizeof(Neuron));
for(i=0;i<5;i++)
    cudaMemcpy(&nL->neurons[i].n_weights,&tmp,sizeof(int),cudaMemcpyHostToDevice);

...then I've tried to modify the "nL->neurons[0].n_weights" variable with that kernel:

__global__ void test(NLayer* n)
           {
              n->neurons[0].n_weights=121;
           }

but at compiling time nvcc returns that "warning" related to the only line of the kernel:

Warning: Cannot tell what pointer points to, assuming global memory space

and when the kernel finish its work the struct begin unreachable.

It's very probably that I'm doing something wrong during the allocation....can someone helps me?? Thanks very much, and sorry for my english! :)

UPDATE:

Thanks to aland I've modified my code creating this function that should allocate an instance of the struct "NLayer":

NLayer* setNLayer(int numNeurons,int weightsPerNeuron,int act_fun)
{
    int i;
    NLayer  h_layer;
    NLayer* d_layer;
    float*  d_weights;

    //SET THE LAYER VARIABLE OF THE HOST NLAYER
    h_layer.act_function=act_fun;
    h_layer.n_neurons=numNeurons;
    //ALLOCATING THE DEVICE NLAYER
    if(cudaMalloc((void**)&d_layer,sizeof(NLayer))!=cudaSuccess)
        puts("ERROR: Unable to allocate the Layer");
    //ALLOCATING THE NEURONS ON THE DEVICE
    if(cudaMalloc((void**)&h_layer.neurons,numNeurons*sizeof(Neuron))!=cudaSuccess)
        puts("ERROR: Unable to allocate the Neurons of the Layer");
    //COPING THE HOST NLAYER ON THE DEVICE
    if(cudaMemcpy(d_layer,&h_layer,sizeof(NLayer),cudaMemcpyHostToDevice)!=cudaSuccess)
                puts("ERROR: Unable to copy the data layer onto the device");

    for(i=0;i<numNeurons;i++)
    {
        //ALLOCATING THE WEIGHTS' ARRAY ON THE DEVICE
        cudaMalloc((void**)&d_weights,weightsPerNeuron*sizeof(float));
        //COPING ITS POINTER AS PART OF THE i-TH NEURONS STRUCT
        if(cudaMemcpy(&d_layer->neurons[i].weights,&d_weights,sizeof(float*),cudaMemcpyHostToDevice)!=cudaSuccess)
                puts("Error: unable to copy weights' pointer to the device");
    }


    //RETURN THE DEVICE POINTER
    return d_layer;
}

and i call that function from the main in that way (the kernel "test" is previously declared):

int main()
{
    NLayer* nL;
    int h_tmp1;
    float h_tmp2;

    nL=setNLayer(10,12,13);
    test<<<1,1>>>(nL);
    if(cudaMemcpy(&h_tmp1,&nL->neurons[0].n_weights,sizeof(float),cudaMemcpyDeviceToHost)!=cudaSuccess);
        puts("ERROR!!");
    printf("RESULT:%d",h_tmp1);

}

When I compile that code the compiler show me the Warning, and when I execute the program it print on screen:

Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
ERROR!!
RESULT:1

The last error doesn't not compare if I comment the kernel call.

Where I'm wrong? I do not know how to do Thanks for your help!

The problem is here:

cudaMalloc((void**)&nL,sizeof(NLayer));
cudaMalloc((void**)&nL->neurons,6*sizeof(Neuron));

In first line, nL is pointing to structure in global memory on device. Therefore, in second line the first argument to cudaMalloc is address residing on GPU, which is undefined behaviour (on my test system, it causes segfault; in your case, though, there is something more subtle).

The correct way to do what you want is first to create structure in host memory, fill it with data, and then copy it to device, like this:

NLayer* nL;
NLayer h_nL;
int i;
int tmp=9;
// Allocate data on device
cudaMalloc((void**)&nL, sizeof(NLayer));
cudaMalloc((void**)&h_nL.neurons, 6*sizeof(Neuron));
// Copy nlayer with pointers to device
cudaMemcpy(nL, &h_nL, sizeof(NLayer), cudaMemcpyHostToDevice);

Also, don't forget to always check for any errors from CUDA routines.

UPDATE

In second version of your code:

cudaMemcpy(&d_layer->neurons[i].weights,&d_weights,...) --- again, you are dereferencing device pointer (d_layer) on host. Instead, you should use

cudaMemcpy(&h_layer.neurons[i].weights,&d_weights,sizeof(float*),cudaMemcpyHostToDevice

Here you take h_layer (host structure), read its element (h_layer.neurons), which is pointer to device memory. Then you do some pointer arithmetics on it (&h_layer.neurons[i].weights). No access to device memory is needed to compute this address.

It all depends on the GPU card your using. The Fermi card uses uniform addressing of shared and global memory space, while pre-Fermi cards don't.

For the pre-Fermi case, you don't know if the address should be shared or global. The compiler can usually figure this out, but there are cases where it can't. When a pointer to shared memory is required, you usually take an address of a shared variable and the compiler can recognise this. The message "assuming global" will appear when this is not explicitly defined.

If you are using a GPU that has compute capabiilty of 2.x or higher, it should work with the -arch=sm_20 compiler flag

来源：https://stackoverflow.com/questions/11874667/cuda-allocation-of-an-array-of-structs-inside-a-struct

标签

struct

cuda

deep-linking