slowness of first cudaMalloc (K40 vs K20), even after cudaSetDevice

前端未结

关注

 2  646

情话喂你 2020-12-07 05:39

I understand CUDA will do initialization during first API call, but the time spent is just too much. Even after a separate cudaSetDevice

The Test program:

2条回答

甜味超标 (楼主)

2020-12-07 06:13

Depending on the target architecture that the code is compiled for and the architecture that is running the code, the JIT compilation can kick in with the first cudaMalloc (or any other) call. "If binary code is not found but PTX is available, then the driver compiles the PTX code." Some more details:

http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-understand-fat-binaries-jit-caching/

0 讨论(0)

查看其它2个回答

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复