Simple CUDA Kernel Optimization

后端 未结 4 1033
别那么骄傲
别那么骄傲 2020-12-29 17:01

In the process of speeding up an application, I have a very simple kernel which does the type casting as shown below:

__global__ void UChar2FloatKernel(float         


        
4条回答
  •  暖寄归人
    2020-12-29 17:58

    You better write a vectorized version of your code, writing float4 into out at once. this should be pretty straightforward in case nElem happens to be a boundary of 4-multiple, otherwise, u might need to mind a residue.

提交回复
热议问题