slowness of first cudaMalloc (K40 vs K20), even after cudaSetDevice

前端 未结 2 646
情话喂你
情话喂你 2020-12-07 05:39

I understand CUDA will do initialization during first API call, but the time spent is just too much. Even after a separate cudaSetDevice

The Test program:

2条回答
  •  甜味超标
    2020-12-07 06:13

    Depending on the target architecture that the code is compiled for and the architecture that is running the code, the JIT compilation can kick in with the first cudaMalloc (or any other) call. "If binary code is not found but PTX is available, then the driver compiles the PTX code." Some more details:

    http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-understand-fat-binaries-jit-caching/

提交回复
热议问题