Smart design for large kernel with different inputs that only changes one line of code
问题 I am designing some kernels that I would like to have 2 ways of calling: Once with standard float * device as input (for writing), and another with cudaSurfaceObject_t as input (for writing). The kernel itself is long (>200 lines) and ultimately, I only need the last line to be different. In one case you have standard out[idx]=val type of assignment, while in the other one a surf3Dwrite() type. The rest of the kernel is identical. Something like __global__ kernel(float * out , ....) { // 200