Failed to get convolution algorithm. This is probably because cuDNN failed to initialize,

问题

In Tensorflow/ Keras when running the code from https://github.com/pierluigiferrari/ssd_keras, use the estimator: ssd300_evaluation. I received this error.

Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.

This is very similar to the unsolved question: Google Colab Error : Failed to get convolution algorithm.This is probably because cuDNN failed to initialize

With the issue I'm running:

python: 3.6.4.

Tensorflow Version: 1.12.0.

Keras Version: 2.2.4.

CUDA: V10.0.

cuDNN: V7.4.1.5.

NVIDIA GeForce GTX 1080.

Also I ran:

import tensorflow as tf
with tf.device('/gpu:0'):
      a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
      b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
      c = tf.matmul(a, b)
with tf.Session() as sess:
print (sess.run(c))

With no errors or issues.

The minimalist example is:

 from keras import backend as K
 from keras.models import load_model
 from keras.optimizers import Adam
 from scipy.misc import imread
 import numpy as np
 from matplotlib import pyplot as plt

 from models.keras_ssd300 import ssd_300
 from keras_loss_function.keras_ssd_loss import SSDLoss
 from keras_layers.keras_layer_AnchorBoxes import AnchorBoxes
 from keras_layers.keras_layer_DecodeDetections import DecodeDetections
 from keras_layers.keras_layer_DecodeDetectionsFast import DecodeDetectionsFast
 from keras_layers.keras_layer_L2Normalization import L2Normalization
 from data_generator.object_detection_2d_data_generator import DataGenerator
 from eval_utils.average_precision_evaluator import Evaluator
 import tensorflow as tf
 %matplotlib inline
 import keras
 keras.__version__



 # Set a few configuration parameters.
 img_height = 300
 img_width = 300
 n_classes = 20
 model_mode = 'inference'


 K.clear_session() # Clear previous models from memory.

 model = ssd_300(image_size=(img_height, img_width, 3),
            n_classes=n_classes,
            mode=model_mode,
            l2_regularization=0.0005,
            scales=[0.1, 0.2, 0.37, 0.54, 0.71, 0.88, 1.05], # The scales 
 for MS COCO [0.07, 0.15, 0.33, 0.51, 0.69, 0.87, 1.05]
            aspect_ratios_per_layer=[[1.0, 2.0, 0.5],
                                     [1.0, 2.0, 0.5, 3.0, 1.0/3.0],
                                     [1.0, 2.0, 0.5, 3.0, 1.0/3.0],
                                     [1.0, 2.0, 0.5, 3.0, 1.0/3.0],
                                     [1.0, 2.0, 0.5],
                                     [1.0, 2.0, 0.5]],
            two_boxes_for_ar1=True,
            steps=[8, 16, 32, 64, 100, 300],
            offsets=[0.5, 0.5, 0.5, 0.5, 0.5, 0.5],
            clip_boxes=False,
            variances=[0.1, 0.1, 0.2, 0.2],
            normalize_coords=True,
            subtract_mean=[123, 117, 104],
            swap_channels=[2, 1, 0],
            confidence_thresh=0.01,
            iou_threshold=0.45,
            top_k=200,
            nms_max_output_size=400)

 # 2: Load the trained weights into the model.

 # TODO: Set the path of the trained weights.
 weights_path = 'C:/Users/USAgData/TF SSD 
 Keras/weights/VGG_VOC0712Plus_SSD_300x300_iter_240000.h5'

 model.load_weights(weights_path, by_name=True)

 # 3: Compile the model so that Keras won't complain the next time you load it.

 adam = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)

 ssd_loss = SSDLoss(neg_pos_ratio=3, alpha=1.0)

 model.compile(optimizer=adam, loss=ssd_loss.compute_loss)


dataset = DataGenerator()

# TODO: Set the paths to the dataset here.
dir= "C:/Users/USAgData/TF SSD Keras/VOC/VOCtest_06-Nov-2007/VOCdevkit/VOC2007/"
Pascal_VOC_dataset_images_dir = dir+ 'JPEGImages'
Pascal_VOC_dataset_annotations_dir = dir + 'Annotations/'
Pascal_VOC_dataset_image_set_filename = dir+'ImageSets/Main/test.txt'

# The XML parser needs to now what object class names to look for and in which order to map them to integers.
classes = ['background',
           'aeroplane', 'bicycle', 'bird', 'boat',
           'bottle', 'bus', 'car', 'cat',
           'chair', 'cow', 'diningtable', 'dog',
           'horse', 'motorbike', 'person', 'pottedplant',
           'sheep', 'sofa', 'train', 'tvmonitor']

dataset.parse_xml(images_dirs=[Pascal_VOC_dataset_images_dir],
                  image_set_filenames=[Pascal_VOC_dataset_image_set_filename],
                  annotations_dirs=[Pascal_VOC_dataset_annotations_dir],
                  classes=classes,
                  include_classes='all',
                  exclude_truncated=False,
                  exclude_difficult=False,
                  ret=False)



evaluator = Evaluator(model=model,
                      n_classes=n_classes,
                      data_generator=dataset,
                      model_mode=model_mode)



results = evaluator(img_height=img_height,
                    img_width=img_width,
                    batch_size=8,
                    data_generator_mode='resize',
                    round_confidences=False,
                    matching_iou_threshold=0.5,
                    border_pixels='include',
                    sorting_algorithm='quicksort',
                    average_precision_mode='sample',
                    num_recall_points=11,
                    ignore_neutral_boxes=True,
                    return_precisions=True,
                    return_recalls=True,
                    return_average_precisions=True,
                    verbose=True)

回答1:

I had this error and I fixed it by uninstalling all CUDA and cuDNN versions from my system. Then I installed CUDA Toolkit 9.0 (without any patches) and cuDNN v7.4.1 for CUDA 9.0.

回答2:

I've seen this error message for three different reasons, with different solutions:

1. You have cache issues

I regularly work around this error by shutting down my python process, removing the ~/.nv directory (on linux, rm -rf ~/.nv), and restarting the Python process. I don't exactly know why this works. It's probably at least partly related to the second option:

2. You're out of memory

The error can also show up if you run out of graphics card RAM. With an nvidia GPU you can check graphics card memory usage with nvidia-smi. This will give you not only a readout of how much GPU RAM you have in use (something like 6025MiB / 6086MiB if you're almost at the limit) as well as a list of what processes are using GPU RAM.

If you've run out of RAM, you'll need to restart the process (which should free up the RAM) and then take a less memory-intensive approach. A few options are:

reducing your batch size
using a simpler model
using less data
limit TensorFlow GPU memory fraction: For example, the following will make sure TensorFlow uses <= 90% of your RAM:

import keras
import tensorflow as tf

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
keras.backend.tensorflow_backend.set_session(tf.Session(config=config))

This can slow down your model evaluation if not used together with the items above, presumably since the large data set will have to be swapped in and out to fit into the small amount of memory you've allocated.

3. You have incompatible versions of CUDA, TensorFlow, NVIDIA drivers, etc.

If you've never had similar models working, you're not running out of VRAM and your cache is clean, I'd go back and set up CUDA + TensorFlow using the best available installation guide - I have had the most success with following the instructions at https://www.tensorflow.org/install/gpu rather than those on the NVIDIA / CUDA site.

回答3:

The problem is with the incompatibility of newer versions of tensorflow 1.10.x plus versions with cudnn 7.0.5 and cuda 9.0. Easiest fix is to downgrade tensorflow to 1.8.0

pip install --upgrade tensorflow-gpu==1.8.0

回答4:

I was struggling with this problem for a week. The reason was very silly: I used high-res photos for training.

Hopefully, this will save someone's time :)

回答5:

The problem can also occur if there are incompatible version of cuDNN, which could be the case if you installed Tensorflow with conda, as conda also installs CUDA and cuDNN while installing Tensorflow.

The solution is to install the Tensorflow with pip, and install CUDA and cuDNN separately without conda e.g. if you have CUDA 10.0.130 and cuDNN 7.4.1 (tested configurations), then

pip install tensorflow-gpu==1.13.1

回答6:

1) close all other notebooks, that use GPU

2) TF 2.0 needs cuDNN SDK (>= 7.4.1)

extract and add path to 'bin' folder into "environment variables / system variables / path": "D:\Programs\x64\Nvidia\cudnn\bin"

回答7:

In my case this error encountered when I directly load the model from .json and .h5 files and attempted to predict output on certain inputs. Hence, before doing anything like this, I tried training an example model on mnist That allowed the cudNN to initialize,

回答8:

I had this problem after upgrading to TF2.0. The following started giving error:

   outputs = tf.nn.conv2d(images, filters, strides=1, padding="SAME")

I am using Ubuntu 16.04.6 LTS (Azure datascience VM) and TensorFlow 2.0. Upgraded per instruction on this TensorFlow GPU instructions page. This resolved the issue for me. By the way, its bunch of apt-get update/installs and I executed all of them.

回答9:

Keras is included in TensorFlow 2.0 above. So

remove import keras and
replace from keras.module.module import class statement to --> from tensorflow.keras.module.module import class
Maybe your GPU memory is filled. So use allow growth = True in GPU option. This is deprecated now. But use this below code snippet after imports may solve your problem.

import tensorflow as tf

from tensorflow.compat.v1.keras.backend import set_session

config = tf.compat.v1.ConfigProto()

config.gpu_options.allow_growth = True # dynamically grow the memory used on the GPU

config.log_device_placement = True # to log device placement (on which device the operation ran)

sess = tf.compat.v1.Session(config=config)

set_session(sess)

来源：https://stackoverflow.com/questions/53698035/failed-to-get-convolution-algorithm-this-is-probably-because-cudnn-failed-to-in

标签

python

tensorflow

keras

cudnn