I am trying to understand resource usage for each of my CUDA threads for a hand-written kernel.
I compiled my kernel.cu file to a kernel.o
__global__ and __device__ functions? Yes, correct__constant__ variables and kernel arguments, different "banks" are used, that starts to get a bit detailed but as long as you use less than 64KB for your __constant__ variables and less than 4KB for kernel arguments you will be ok.