Can anyone describe the differences between __global__ and __device__ ?
When should I use __device__, and when to use __glob
I am recording some unfounded speculations here for the time being (I will substantiate these later when I come across some authoritative source)...
__device__ functions can have a return type other than void but __global__ functions must always return void.
__global__ functions can be called from within other kernels running on the GPU to launch additional GPU threads (as part of CUDA dynamic parallelism model (aka CNP)) while __device__ functions run on the same thread as the calling kernel.