Does CUDA C++ math function of exp have override functions by default?

问题

The problem comes from the document where I find two function exp and expf. It is said that exp means double exp(double) and expf means float expf(float). I wonder if exp can have default override version such as float exp(float) or fp16 exp(fp16). Or must I use different functions when the input are different types ?

Consider a scenario where I use template:

template <typename T>
T compute (T in) {return exp(in);}

If there is no default float exp(float), I cannot use compute<float>(1.f) to call this template function. I know that I can call that function like that but I do not how how does the compiler deal with it. When I call exp(1.f), does the compiler first cast the input into double and the cast the return value back to float, or does the compiler use the float number as input directly?

回答1:

It is said that exp means double exp(double) and expf means float expf(float). I wonder if exp can have default override version such as float exp(float) ...

Yes, the CUDA compiler does what a normal C++ compiler does and will transparently overload the correct version of the function for the correct type. This works for floatand double ...

... or fp16 exp(fp16).

... but it does not presently work for half precision floating point.

As an example, this:

$ cat overlay.cu
#include <cuda_fp16.h>

template<typename T>
__global__ void kernel(const T* x, const T* y, T* output, int N)
{
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N)
        output[tid] = exp(x[tid]) * y[tid];
};

template __global__ void kernel<float>(const float*, const float*, float*, int);
template __global__ void kernel<double>(const double*, const double*, double*, int);

will compile correctly:

$ nvcc -arch=sm_70 -Xptxas="-v" -c overlay.cu
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelIdEvPKT_S2_PS0_i' for 'sm_70'
ptxas info    : Function properties for _Z6kernelIdEvPKT_S2_PS0_i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 16 registers, 380 bytes cmem[0], 88 bytes cmem[2]
ptxas info    : Compiling entry function '_Z6kernelIfEvPKT_S2_PS0_i' for 'sm_70'
ptxas info    : Function properties for _Z6kernelIfEvPKT_S2_PS0_i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 14 registers, 380 bytes cmem[0]

but adding

template __global__ void kernel<__half>(const __half*, const __half*, __half*, int);

will fail:

$ nvcc -arch=sm_70 -Xptxas="-v" -c overlay.cu
overlay.cu(9): error: more than one instance of overloaded function "exp" matches the argument list:
            function "std::exp(long double)"
            function "std::exp(float)"
            argument types are: (const __half)
          detected during instantiation of "void kernel(const T *, const T *, T *, int) [with T=__half]"

As pointed out in comments, C++14/C++17 don't define a standardized half precision type or standard library, so this error is pretty much in line with expected behaviour.

If you want a half precision version, then I suggest using explicit template specialization for the fp16 version which exploits the (most performant) intrinsic for the type, for example:

#include <cuda_fp16.h>

template<typename T>
__global__ void kernel(const T* x, const T* y, T* output, int N)
{
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N)
        output[tid] = exp(x[tid]) * y[tid];
};

template __global__ void kernel<float>(const float*, const float*, float*, int);
template __global__ void kernel<double>(const double*, const double*, double*, int);

template<> __global__ void kernel(const __half* x, const __half* y, __half* output, int N)
{
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N)
        output[tid] = hexp(x[tid]) * y[tid];
};

is probably the most optimal implementation at this stage, which compiles as expected:

$ nvcc -std=c++11 -arch=sm_70 -Xptxas="-v" -c overlay.cu
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelI6__halfEvPKT_S3_PS1_i' for 'sm_70'
ptxas info    : Function properties for _Z6kernelI6__halfEvPKT_S3_PS1_i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 12 registers, 380 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelIdEvPKT_S2_PS0_i' for 'sm_70'
ptxas info    : Function properties for _Z6kernelIdEvPKT_S2_PS0_i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 16 registers, 380 bytes cmem[0], 88 bytes cmem[2]
ptxas info    : Compiling entry function '_Z6kernelIfEvPKT_S2_PS0_i' for 'sm_70'
ptxas info    : Function properties for _Z6kernelIfEvPKT_S2_PS0_i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 14 registers, 380 bytes cmem[0]

[Answer assembled from comments with own editorialisation added to get question off unanswered list for the CUDA tag. Please edit/improve as you see fit]

来源：https://stackoverflow.com/questions/63065825/does-cuda-c-math-function-of-exp-have-override-functions-by-default

标签

c++

cuda