I know that NVIDIA gpus with compute capability 2.x or greater can execute u pto 16 kernels concurrently. However, my application spawns 7 \"processes\" and each of these 7 proc
Do you really need to have separate threads and contexts? I believe that best practice is a usage one context per GPU, because multiple contexts on single GPU bring a sufficient overhead.
To execute many kernels concrurrenlty you should create few CUDA streams in one CUDA context and queue each kernel into its own stream - so they will be executed concurrently, if there are enough resources for it.
If you need to make the context accessible from few CPU threads - you can use cuCtxPopCurrent(), cuCtxPushCurrent() to pass them around, but only one thread will be able to work with the context at any time.