I am currently going through the tutorial examples on http://code.google.com/p/stanford-cs193g-sp2010/ to learn CUDA. The code which demostrates __global__ func
The problem: you have to return two values: Return code AND pointer to memory (in case return code indicates success). So you must make one of it a pointer to return type. And as the return type you have the choice between return pointer to int (for error code) or return pointer to pointer (for memory address). There one solution is as good as the other (and one of it yields the pointer to pointer (I prefer to use this term instead of double pointer, as this sounds more as a pointer to a double floating point number)).
In malloc you have the nice property that you can have null pointers to indicate an error, so you basically need just one return value.. I am not sure if this is possible with a pointer to device memory, as it might be that there is no or a wrong null value (remember: This is CUDA and NOT Ansi C). It could be that the null pointer on the host system is entirely different from the null used for the device, and as such the return of null pointer to indicate errors does not work, and you must make the API this way (that would also mean that you have NO common NULL on both devices).