In the process of speeding up an application, I have a very simple kernel which does the type casting as shown below:
__global__ void UChar2FloatKernel(float
You better write a vectorized version of your code, writing float4 into out at once. this should be pretty straightforward in case nElem happens to be a boundary of 4-multiple, otherwise, u might need to mind a residue.