How to generate, compile and run CUDA kernels at runtime

问题

Well, I have quite a delicate question :)

Let's start with what I have:

Data, large array of data, copied to GPU
Program, generated by CPU (host), which needs to be evaluated for every data in that array
The program changes very frequently, can be generated as CUDA string, PTX string or something else (?) and needs to be re-evaluated after each change

What I want: Basically just want to make this as effective (fast) as possible, eg. avoid compilation of CUDA to PTX. Solution can be even completely device-specific, no big compatibility is required here :)

What I know: I already know function cuLoadModule, which can load and create kernel from PTX code stored in file. But I think, there must be some other way to create a kernel directly, without saving it to file first. Or perhaps it may be possible to store it as bytecode?

My question: How would you do that? Could you post an example or link to website with similar topic? TY

Edit: OK now, PTX kernel can be run from PTX string (char array) directly. Anyways I still wonder, is there some better / faster solution to this? There is still conversion from string to some PTX bytecode, which should be possibly avoided. I also suspect, that some clever way of creating device specific Cuda binary from PTX might exist, which would remove JIT compiler lag (is small, but it can add up if you have huge numbers of kernels to run) :)

回答1:

In his comment, Roger Dahl has linked the following post

Passing the PTX program to the CUDA driver directly

in which the use of two functions, namely cuModuleLoad and cuModuleLoadDataEx, are addressed. The former is used to load PTX code from file and passing it to the nvcc compiler driver. The latter avoids I/O and enables to pass the PTX code to the driver as a C string. In either cases, you need to have already at your disposal the PTX code, either as the result of the compilation of a CUDA kernel (to be loaded or copied and pasted in the C string) or as an hand-written source.

But what happens if you have to create the PTX code on-the-fly starting from a CUDA kernel? Following the approach in CUDA Expression templates, you can define a string containing your CUDA kernel like

ss << "extern \"C\" __global__ void kernel( ";
ss << def_line.str() << ", unsigned int vector_size, unsigned int number_of_used_threads ) { \n";
ss << "\tint idx = blockDim.x * blockIdx.x + threadIdx.x; \n";
ss << "\tfor(unsigned int i = 0; i < ";
ss << "(vector_size + number_of_used_threads - 1) / number_of_used_threads; ++i) {\n";
ss << "\t\tif(idx < vector_size) { \n";
ss << "\t\t\t" << eval_line.str() << "\n";
ss << "\t\t\tidx += number_of_used_threads;\n";
ss << "\t\t}\n";
ss << "\t}\n";
ss << "}\n\n\n\n";

then using system calls to compile it as

int nvcc_exit_status = system(
         (std::string(NVCC) + " -ptx " + NVCC_FLAGS + " " + kernel_filename 
              + " -o " + kernel_comp_filename).c_str()
    );

    if (nvcc_exit_status) {
            std::cerr << "ERROR: nvcc exits with status code: " << nvcc_exit_status << std::endl;
            exit(1);
    }

and finally use cuModuleLoad and cuModuleGetFunction to load the PTX code from file and passing it to the compiler driver like

    result = cuModuleLoad(&cuModule, kernel_comp_filename.c_str());
    assert(result == CUDA_SUCCESS);
    result =  cuModuleGetFunction(&cuFunction, cuModule, "kernel");
    assert(result == CUDA_SUCCESS);

Of course, expression templates have nothing to do with this problem and I'm only quoting the source of the ideas I'm reporting in this answer.

来源：https://stackoverflow.com/questions/19838440/how-to-generate-compile-and-run-cuda-kernels-at-runtime

标签

cuda

compilation

gpgpu

ptx