Is it unsafe to run multiple tensorflow processes on the same GPU?

问题

I only have one GPU (Titan X Pascal, 12 GB VRAM) and I would like to train multiple models, in parallel, on the same GPU.

I tried encapsulated my model in a single python program (called model.py), and I included code in model.py to restrict VRAM usage (based on this example). I was able to run up to 3 instances of model.py concurrently on my GPU (with each instance taking a little less than 33% of my VRAM). Mysteriously, when I tried with 4 models I received an error:

2017-09-10 13:27:43.714908: E tensorflow/stream_executor/cuda/cuda_dnn.cc:371] coul d not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2017-09-10 13:27:43.714973: E tensorflow/stream_executor/cuda/cuda_dnn.cc:338] coul d not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM 2017-09-10 13:27:43.714988: F tensorflow/core/kernels/conv_ops.cc:672] Check failed : stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNon fusedAlgo<T>(), &algorithms) Aborted (core dumped)

I later observed on the tensorflow Github that people seem to think that it is unsafe to have more than one tensorflow process running per GPU. Is this true, and is there an explanation for why this is the case? Why was I able to have 3 tensorflow processes running on the same GPU and not 4?

回答1:

In short: yes it is safe to run multiple procceses on the same GPU (as of May 2017). It was previously unsafe to do so.

Link to tensorflow source code that confirms this

回答2:

Answer

Depending on video memory size, it will be allowed or not.

For my case I have total video memory of 2GBs while the single instance reserves about 1.4GB. When I have tried to run another tensorflow code while I was running already the speech recognition training.

2018-08-28 08:52:51.279676: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1405] Found device 0 with properties:
name: GeForce 940MX major: 5 minor: 0 memoryClockRate(GHz): 1.2415
pciBusID: 0000:01:00.0
totalMemory: 2.00GiB freeMemory: 1.65GiB
2018-08-28 08:52:51.294948: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1484] Adding visible gpu devices: 0
2018-08-28 08:52:55.643813: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-28 08:52:55.647912: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971]      0
2018-08-28 08:52:55.651054: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:984] 0:   N
2018-08-28 08:52:55.656853: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1409 MB memory) -> physical GPU (device: 0, name: GeForce 940MX, pci bus id: 0000:01:00.0, compute
capability: 5.0)

I got the following error in speech recogntion, which completely terminated the script: (I think according to this is related to out of video memory)

2018-08-28 08:53:05.154711: E T:\src\github\tensorflow\tensorflow\stream_executor\cuda\cuda_driver.cc:1108] could not synchronize on CUDA context: CUDA_ERROR_LAUNCH_FAILED ::
Traceback (most recent call last):
  File "C:\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1278, in _do_call
    return fn(*args)
  File "C:\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1263, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "C:\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: GPU sync failed

来源：https://stackoverflow.com/questions/46145100/is-it-unsafe-to-run-multiple-tensorflow-processes-on-the-same-gpu

标签

python

tensorflow

gpu

nvidia