I am having some trouble with concurrent CUDA. Take a look at the attached image. Th
This is the expected behavior on Windows with the WDDM driver model, where the driver tries to mitigate the kernel launch overhead by trying to batch kernel launches. Try inserting cudaStreamQuery(0)
straight after the kernel invocation to trigger early launching of the kernel before the batch is full.