Compiling CUDA with dynamic parallelism fallback - multiple architectures/compute capability

柔情痞子 提交于 2019-12-08 11:01:47

问题


In one application, I've got a bunch of CUDA kernels. Some use dynamic parallelism and some don't. For the purposes of either providing a fallback option if this is not supported, or simply allowing the application to continue but with reduced/partially available features, how can I go about compiling?

At the moment I'm getting invalid device function when running kernels compiled with -arch=sm_35 on a 670 (max sm_30) that don't require compute 3.5.

AFAIK you can't use multiple -arch=sm_* arguments and using multiple -gencode=* doesn't help. Also for separable compilation I've had to create an additional object file using -dlink, but this doesn't get created when using compute 3.0 (nvlink fatal : no candidate found in fatbinary due to -lcudadevrt, which I've needed for 3.5), how should I deal with this?


回答1:


I believe this issue has been addressed now in CUDA 6.

Here's my simple test:

$ cat t264.cu
#include <stdio.h>

__global__ void kernel1(){
  printf("Hello from DP Kernel\n");
}

__global__ void kernel2(){

#if __CUDA_ARCH__ >= 350
  kernel1<<<1,1>>>();
#else
  printf("Hello from non-DP Kernel\n");
#endif
}

int main(){

  kernel2<<<1,1>>>();
  cudaDeviceSynchronize();
  return 0;
}
$ nvcc -O3 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_35,code=sm_35 -rdc=true -o t264 t264.cu -lcudadevrt
$ CUDA_VISIBLE_DEVICES="0" ./t264
Hello from non-DP Kernel
$ CUDA_VISIBLE_DEVICES="1" ./t264
Hello from DP Kernel
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Sat_Jan_25_17:33:19_PST_2014
Cuda compilation tools, release 6.0, V6.0.1
$

In my case, device 0 is a Quadro5000, a cc 2.0 device, and device 1 is a GeForce GT 640, a cc 3.5 device.




回答2:


I don't believe there is a way to do this using the runtime API as of CUDA 5.5.

The only way I can think of to get around the problem is to use the driver API to perform your own architecture selection and load code from different cubin files at runtime. The APIs can be safely mixed, so it is only the context establishment-device selection-module load phase which needs to be done with the driver API. You can use the runtime API after that - you will need a little bit of homemade syntactic sugar for the kernel launches, but otherwise no code changes are required in other runtime API code.



来源:https://stackoverflow.com/questions/20896136/compiling-cuda-with-dynamic-parallelism-fallback-multiple-architectures-comput

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!