I am currently going through the tutorial examples on http://code.google.com/p/stanford-cs193g-sp2010/ to learn CUDA. The code which demostrates __global__ func
This is simply a horrible, horrible API design. The problem with passing double-pointers for an allocation function that obtains abstract (void *) memory is that you have to make a temporary variable of type void * to hold the result, then assign it into the real pointer of the correct type you want to use. Casting, as in (void**)&device_array, is invalid C and results in undefined behavior. You should simply write a wrapper function that behaves like normal malloc and returns a pointer, as in:
void *fixed_cudaMalloc(size_t len)
{
void *p;
if (cudaMalloc(&p, len) == success_code) return p;
return 0;
}