Linking with 3rd party CUDA libraries slows down cudaMalloc
It is not a secret that on CUDA 4.x the first call to cudaMalloc can be ridiculously slow (which was reported several times), seemingly a bug in CUDA drivers. Recently, I noticed weird behaviour: the running time of cudaMalloc directly depends on how many 3rd-party CUDA libraries I linked to my program (note that I do NOT use these libraries, just link my program with them) I ran some tests using the following program: int main() { cudaSetDevice(0); unsigned int *ptr = 0; cudaMalloc((void **)&ptr, 2000000 * sizeof(unsigned int)); cudaFree(ptr); return 1; } the results are as follows: Linked