TL;DR: CUDA lazy initialization (as @RobertCrovella suggests).
@RobertCrovella explains in the dupe bug:
CUDA initialization usually includes establishment of UVM, which involves harmonizing of device and host memory maps. If your server has more system memory than your PC, it is one possible explanation for the disparity in initialization time. The OS may have an effect as well, finally the memory size of the GPU may have an effect.
the machine on which I get this behavior has 256 GB of memory, 32 times more than my home machine; and the GPU itself has 12 GB, 4 times more than the GPU on my home machine. This means I can - unfortunately - expect much longer initialization of the CUDA driver and/or runtime API than on my home machine. Some or all of this initialization is performed in a lazy fashion, which in my case happens to be when cudaGetCacheConfig() is called; I suppose the other calls only require some of the initialization (not clear why, though).