问题:

How do I solve the problem of ResourceExhaustedError: OOM when allocating tensor？

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[10000,32,28,28]

I included nearly all the code

learning_rate = 0.0001 epochs = 10 batch_size = 50  # declare the training data placeholders # input x - for 28 x 28 pixels = 784 - this is the flattened image data that is drawn from # mnist.train.nextbatch() x = tf.placeholder(tf.float32, [None, 784]) # dynamically reshape the input x_shaped = tf.reshape(x, [-1, 28, 28, 1]) # now declare the output data placeholder - 10 digits y = tf.placeholder(tf.float32, [None, 10]) def create_new_conv_layer(input_data, num_input_channels, num_filters, filter_shape, pool_shape, name):     # setup the filter input shape for tf.nn.conv_2d     conv_filt_shape = [filter_shape[0], filter_shape[1], num_input_channels,                       num_filters]      # initialise weights and bias for the filter     weights = tf.Variable(tf.truncated_normal(conv_filt_shape, stddev=0.03),                                       name=name+'_W')     bias = tf.Variable(tf.truncated_normal([num_filters]), name=name+'_b')      # setup the convolutional layer operation     out_layer = tf.nn.conv2d(input_data, weights, [1, 1, 1, 1], padding='SAME')      # add the bias     out_layer += bias      # apply a ReLU non-linear activation     out_layer = tf.nn.relu(out_layer)      # now perform max pooling     ksize = [1, 2, 2, 1]     strides = [1, 2, 2, 1]     out_layer = tf.nn.max_pool(out_layer, ksize=ksize, strides=strides,                                padding='SAME')      return out_layer # create some convolutional layers layer1 = create_new_conv_layer(x_shaped, 1, 32, [5, 5], [2, 2], name='layer1') layer2 = create_new_conv_layer(layer1, 32, 64, [5, 5], [2, 2], name='layer2')  flattened = tf.reshape(layer2, [-1, 7 * 7 * 64])  # setup some weights and bias values for this layer, then activate with ReLU wd1 = tf.Variable(tf.truncated_normal([7 * 7 * 64, 1000], stddev=0.03), name='wd1') bd1 = tf.Variable(tf.truncated_normal([1000], stddev=0.01), name='bd1') dense_layer1 = tf.matmul(flattened, wd1) + bd1 dense_layer1 = tf.nn.relu(dense_layer1)  # another layer with softmax activations wd2 = tf.Variable(tf.truncated_normal([1000, 10], stddev=0.03), name='wd2') bd2 = tf.Variable(tf.truncated_normal([10], stddev=0.01), name='bd2') dense_layer2 = tf.matmul(dense_layer1, wd2) + bd2 y_ = tf.nn.softmax(dense_layer2) cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=dense_layer2, labels=y))   # add an optimiser optimiser = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cross_entropy)  # define an accuracy assessment operation correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1)) accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))  # setup the initialisation operator init_op = tf.global_variables_initializer()      with tf.Session() as sess:             # initialise the variables             sess.run(init_op)             total_batch = int(len(mnist.train.labels) / batch_size)             for epoch in range(epochs):                 avg_cost = 0                 for i in range(total_batch):                     batch_x, batch_y = mnist.train.next_batch(batch_size=batch_size)                     _, c = sess.run([optimiser, cross_entropy], feed_dict={x:           batch_x,              y: batch_y})                     avg_cost += c / total_batch                 test_acc = sess.run(accuracy,feed_dict={x: mnist.test.images, y:              mnist.test.labels})                 print("Epoch:", (epoch + 1), "cost =", "{:.3f}".format(avg_cost), "               test accuracy: {:.3f}".format(test_acc))              print("\nTraining complete!")             print(sess.run(accuracy, feed_dict={x: mnist.test.images, y:              mnist.test.labels}))

and those lines referenced in the error are : create_new_conv_layer - function

sess.run .. in the training loop

More errors I copied from the debuggers output are listed is below (there were more lines but i think these ones are main one and the others are caused by this..)

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[10000,32,28,28] [[Node: Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](Reshape, layer1_W/read)]]

The second time i run it is issued the following error I have both cpu and GPU as can be seen in the output below , I can understand some of the errors related to cpu issues might be becuase my tensorflow wasnt compiled to use those features , I installed cuda 8 and cudnn 6 , python 3.5 , tensorflow 1.3.0 on windows 10.

2017-10-03 03:53:58.944371: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2017-10-03 03:53:58.945563: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2017-10-03 03:53:59.230761: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:955] Found device 0 with properties: name: Quadro K620 major: 5 minor: 0 memoryClockRate (GHz) 1.124 pciBusID 0000:01:00.0 Total memory: 2.00GiB Free memory: 1.66GiB 2017-10-03 03:53:59.231109: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:976] DMA: 0 2017-10-03 03:53:59.231229: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:986] 0: Y 2017-10-03 03:53:59.231363: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro K620, pci bus id: 0000:01:00.0) 2017-10-03 03:54:01.511141: E C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\stream_executor\cuda\cuda_dnn.cc:371] could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED 2017-10-03 03:54:01.511372: E C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\stream_executor\cuda\cuda_dnn.cc:375] error retrieving driver version: Unimplemented: kernel reported driver version not implemented on Windows 2017-10-03 03:54:01.511862: E C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\stream_executor\cuda\cuda_dnn.cc:338] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM 2017-10-03 03:54:01.512074: F C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\kernels\conv_ops.cc:672] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo(), &algorithms)

回答1:

The process failed with out-of-memory (OOM) because you pushed the whole test set for evaluation at once (see this question). It's easy to see that 10000 * 32 * 28 * 28 * 4 is almost 1Gb, while your GPU has only 1.66Gb available in total and most of it is already taken by the network itself.

The solution is to feed the neural network batches not only for training, but for testing as well. The result accuracy is going to be an average across all batches. Moreover, you don't need to do this after each epoch: are you really interested in test results of all intermediate networks?

Your second error message is very likely a result of the previous failures, because CUDNN driver doesn't seem to work anymore. I'd suggest to restart your machine.

转载请标明出处:OOM when allocating tensor

文章来源: OOM when allocating tensor

标签

tensorflow

tensor

Jenkins