Is it possible to call cuBLAS or cuBLASLt functions from CUDA 10.1 kernels?

问题

Concerning CUDA 10.1

I'm doing some calculations on geometric meshes with a large amount of independent calculations done per face of the mesh. I run a CUDA kernel which does the calculation for each face.

The calculations involve some matrix multiplication, so I'd like to use cuBLAS or cuBLASLt to speed things up. Since I need to do many matrix multiplications (at least a couple per face) I'd like to do it directly in the kernel. Is this possible?

It doesn't seem like cuBLAS or cuBLASLt allows you to call their functions from kernel (__global__) code. I get the following error from Visual Studio:

"calling a __host__ function from a __device__ function is not allowed"

There are some old answers (Could a CUDA kernel call a cublas function?) that imply that this is possible though?

Basically, I'd like a kernel like this:

__global__
void calcPerFace(...)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    for (int i = index; i < faceCount; i += stride)
    {
        // Calculate some matrices for each face in the mesh
        ...
        // Multiply those matrices
        cublasLtMatmul(...) // <- not allowed by cuBLASLt
        // Continue calculation
        ...
    }
}

Is it possible to call cublasLtMatmul or perhaps cublassgemm from a kernel like this in CUDA 10.1?

回答1:

It is not possible

Starting with CUDA 10.0, CUDA no longer supports the ability to call CUBLAS routines from device code.

A deprecation notice was given prior to CUDA 10.0, and the formal announcement exists in the CUDA 10.0 release notes:

The cuBLAS library, to support the ability to call the same cuBLAS APIs from within the device routines (cublas_device), is dropped starting with CUDA 10.0.

Likewise, CUDA sample codes that depended on this capability, such as simpleDevLibCUBLAS, are no longer part of the CUDA toolkit distribution, starting with CUDA 10.0.

This applies to CUBLAS only, and does not mean that the general capability of CUDA dynamic parallelism has been removed.

I won't be able to respond to questions that ask "why?" or are variants of "why?" I won't be able to respond to questions that ask about future events or topics. There are no technical reasons that this functionality was not workable or could not be supported. The reasons for the change had to do with development and resource priorities. I won't be able to go deeper than that. If you would like to see a change in behavior for CUDA, whether that be in functionality, performance, or documentation, you are encouraged to express your desire by filing a bug at http://developer.nvidia.com. The specific bug filing instructions are linked here.

For CUDA device code that performs some preparatory work, then calls CUBLAS, then performs some other work, the general suggestion would be to break this into a kernel that performs the preparatory work, then launch the desired CUBLAS routines from the host, then perform the remaining work in a subsequent kernel. This does not imply that data would have to be moved back and forth between device and host. When multiple CUBLAS calls would have been performed (e.g. per device thread) then it may be beneficial to investigate the various kinds of CUBLAS batched functionality that are available. It's not possible to give a single recipe to refactor every kind of code. These suggestions may not address every case.

来源：https://stackoverflow.com/questions/57371249/is-it-possible-to-call-cublas-or-cublaslt-functions-from-cuda-10-1-kernels

标签

c++

visual-studio

cuda

cublas