GPU Sync Failed While using tensorflow

问题

I'm trying to run this simple code to test tensorflow

  from __future__ import print_function

    import tensorflow as tf

    a = tf.constant(2)
    b = tf.constant(3)


    with tf.Session() as sess:
        print("a=2, b=3")
        print("Addition with constants: %i" % sess.run(a+b))

But weirdly getting GPU sync failed error.

Traceback:

runfile('D:/tf_examples-master/untitled3.py', wdir='D:/tf_examples-master')
a=2, b=3
Traceback (most recent call last):

  File "<ipython-input-5-d4753a508b93>", line 1, in <module>
    runfile('D:/tf_examples-master/untitled3.py', wdir='D:/tf_examples-master')

  File "C:\ProgramData\Anaconda3\envs\env3-gpu\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)

  File "C:\ProgramData\Anaconda3\envs\env3-gpu\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "D:/tf_examples-master/untitled3.py", line 15, in <module>
    print("Multiplication with constants: %i" % sess.run(a*b))

  File "C:\ProgramData\Anaconda3\envs\env3-gpu\lib\site-packages\tensorflow\python\client\session.py", line 900, in run
    run_metadata_ptr)

  File "C:\ProgramData\Anaconda3\envs\env3-gpu\lib\site-packages\tensorflow\python\client\session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)

  File "C:\ProgramData\Anaconda3\envs\env3-gpu\lib\site-packages\tensorflow\python\client\session.py", line 1316, in _do_run
    run_metadata)

  File "C:\ProgramData\Anaconda3\envs\env3-gpu\lib\site-packages\tensorflow\python\client\session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)

InternalError: GPU sync failed

Any help will be appreciated.

回答1:

When I got this error GPU sync failed. Restarting my notebook/kernel did not help.

I had another notebook/kernel that was not shutdown and was using my GPU, so to fix this issue all I did was to shutdown the other notebook, restart my current notebook and everything worked!

回答2:

TLDR: If you find that tensorflow is throwing a GPU sync failed Error, it may be because the model's inputs are too large (as was my case when first running into this problem) or you don't have cuDNN installed properly. Verify that cuDNN is installed correctly and reset your nvidia caches (ie. sudo -rf $HOME/.nv/) (if you have no yet done so after initially installing CUDA and cuDNN) and restart your machine.

Running an example found in the tensorflow (TF) docs (https://www.tensorflow.org/tutorials/keras/save_and_restore_models#checkpoint_callback_usage), was getting the error

"GPU sync failed Error"

when running a tf.keras model (with a large input (vectorized MNIST feature data (length=28^2))). Looking into this problem, found this post here (https://github.com/tensorflow/tensorflow/issues/5688) (which talks about the problem being caused specifically by large inputs to a model) and (following the chain of supposed effect) here (https://github.com/tensorflow/tensorflow/issues/5688). The last line of the 2nd post question showing error message snippet

F tensorflow/stream_executor/cuda/cuda_dnn.cc:2440] failed to enqueue convolution on stream: CUDNN_STATUS_NOT_SUPPORTED

From this, I decided to try and test if (as required by TF) cuDNN was actually installed correctly (https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html#installlinux-deb). Following the docs to try to verify the cuDNN install (https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html#verify),

#Copy the cuDNN sample to a writable path.
$cp -r /usr/src/cudnn_samples_v7/ $HOME
#Go to the writable path.
$ cd  $HOME/cudnn_samples_v7/mnistCUDNN
#Compile the mnistCUDNN sample.
$make clean && make
#Run the mnistCUDNN sample.
$ ./mnistCUDNN
#If cuDNN is properly installed and running on your Linux system, you will see a message similar to the following:
Test passed!

found that was throwing error

cudnnGetVersion() : 6021 , CUDNN_VERSION from cudnn.h : 6021 (6.0.21)
Host compiler version : GCC 5.4.0
There are 1 CUDA capable devices on your machine :
device 0 : sms 20  Capabilities 6.1, SmClock 1797.0 Mhz, MemSize (Mb) 8107, MemClock 5005.0 Mhz, Ecc=0, boardGroupID=0
Using device 0

Testing single precision
CUDNN failure
Error: CUDNN_STATUS_INTERNAL_ERROR
mnistCUDNN.cpp:394
Aborting...

Looking into this more, found nvidiadev threads here (https://devtalk.nvidia.com/default/topic/1025900/cudnn/cudnn-fails-with-cudnn_status_internal_error-on-mnist-sample-execution/post/5259556/#5259556) and here (https://devtalk.nvidia.com/default/topic/1024761/cuda-setup-and-installation/cudnn_status_internal_error-when-using-cudnn7-0-with-cuda-8-0/post/5217666/#5217666), which recommend clearing the nvidia caches via

sudo rm -rf ~/.nv/

and restarting (else both installation verification tests for CUDA and cuDNN will fail) my machine. After doing this, both CUDA (https://docs.nvidia.com/cuda/archive/9.0/cuda-installation-guide-linux/index.html#install-samples) and cuDNN (https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html#installlinux-deb) installation checks passed.

And was finally able to successfully run the TF model without error.

model.fit(train_images, train_labels,  
          epochs = 10, 
          validation_data = (test_images, test_labels),
          callbacks = [cp_callback])  # pass callback to training

Train on 1000 samples, validate on 1000 samples Epoch 1/10 1000/1000 [==============================] - 1s 604us/step - loss: 1.1795 - acc: 0.6720 - val_loss: 0.7519 - val_acc: 0.7580

Epoch 00001: saving model to training_1/cp.ckpt WARNING:tensorflow:This model was compiled with a Keras optimizer () but is being saved in TensorFlow format with save_weights. The model's weights will be saved, but unlike with TensorFlow optimizers in the TensorFlow format the optimizer's state will not be saved. .....

Hope this helps you.

Note: this may be an easy problem to run into, since the tensorflow docs explicitly require that both CUDA and cuDNN be installed for GPU support in TF, but you can actually pip install tensorflow-gpu without installing cuDNN even though this is not the correct thing to do, which (if someone where too eager) could mislead someone to blame something in their code rather than some other underlying installation requirement (which would actually be the right choice in this case).

回答3:

I had the same error

GPU sync failed

today when my CNN had run about 12 hours.
Restarting the computer solved this problem temporarily.

Edited:

Today I had this error again. Instead of restarting the computer I restarted IPython console and the error disappeared too. It seems in the same python environment tensorflow can no longer find an available GPU. If the python environment is restarted, everything goes back to normal. I'm using tensorflow-gpu v1.10.0 and cudnn v7.1.4 with GTX 950M.

回答4:

This is an older question, but for those that come across this, my fix was different than the other answers.

The code used import schedule to run a Tensorflow model at scheduled times. The code would run the first time without issue, then on a second run the code would return a

GPU sync failed

error. Previously, I had fixed a memory issue using from numba import cuda to release the Tensorflow allocated memory. The code used included a line, cuda.close() as I thought that Tensorflow would reopen a Cuda session at the next run. I eliminated the line cuda.close() and everything has been working well ever since.

来源：https://stackoverflow.com/questions/51112126/gpu-sync-failed-while-using-tensorflow

标签

python

python-3.x

tensorflow