How to fix “ResourceExhaustedError: OOM when allocating tensor”

假装没事ソ 提交于 2020-07-21 11:35:05

问题


I wanna make a model with multiple inputs. So, I try to build a model like this.

# define two sets of inputs
inputA = Input(shape=(32,64,1))
inputB = Input(shape=(32,1024))

# CNN
x = layers.Conv2D(32, kernel_size = (3, 3), activation = 'relu')(inputA)
x = layers.Conv2D(32, (3,3), activation='relu')(x)
x = layers.MaxPooling2D(pool_size=(2,2))(x)
x = layers.Dropout(0.2)(x)
x = layers.Flatten()(x)
x = layers.Dense(500, activation = 'relu')(x)
x = layers.Dropout(0.5)(x)
x = layers.Dense(500, activation='relu')(x)
x = Model(inputs=inputA, outputs=x)

# DNN
y = layers.Flatten()(inputB)
y = Dense(64, activation="relu")(y)
y = Dense(250, activation="relu")(y)
y = Dense(500, activation="relu")(y)
y = Model(inputs=inputB, outputs=y)

# Combine the output of the two models
combined = concatenate([x.output, y.output])


# combined outputs
z = Dense(300, activation="relu")(combined)
z = Dense(100, activation="relu")(combined)
z = Dense(1, activation="softmax")(combined)

model = Model(inputs=[x.input, y.input], outputs=z)

model.summary()

opt = Adam(lr=1e-3, decay=1e-3 / 200)
model.compile(loss = 'sparse_categorical_crossentropy', optimizer = opt,
    metrics = ['accuracy'])

and the summary : _

But, when i try to train this model,

history = model.fit([trainimage, train_product_embd],train_label,
    validation_data=([validimage,valid_product_embd],valid_label), epochs=10, 
    steps_per_epoch=100, validation_steps=10)

the problem happens.... :

--------------------------------------------------------------------------- ResourceExhaustedError Traceback (most recent call last) in () ----> 1 history = model.fit([trainimage, train_product_embd],train_label, validation_data=([validimage,valid_product_embd],valid_label), epochs=10, steps_per_epoch=100, validation_steps=10)

4 frames /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py in call(self, *args, **kwargs) 1470 ret = tf_session.TF_SessionRunCallable(self._session._session, 1471
self._handle, args, -> 1472 run_metadata_ptr) 1473 if run_metadata: 1474
proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[800000,32,30,62] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node conv2d_1/convolution}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[metrics/acc/Mean_1/_185]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[800000,32,30,62] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node conv2d_1/convolution}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations. 0 derived errors ignored.

Thanks for reading and hopefully helping me :)


回答1:


OOM stands for "out of memory". Your GPU is running out of memory, so it can't allocate memory for this tensor. There are a few things you can do:

  • Decrease the number of neurons in your Dense, Conv2D layers
  • Use a smaller batch_size (or increase steps_per_epoch)
  • Use grayscale images (there will be one channel instead of three)
  • Reduce the number of layers
  • Use more MaxPooling2D layers, and increase their pool size
  • Use larger strides in your Conv2D layers
  • Reduce the size your images (you can use PIL or cv2 for that)
  • Apply dropout
  • Use smaller float precision, namely np.float32 if you accidentally used np.float64
  • If you're using a pre-trained model, freeze the first layers

There is more useful information in this error:

OOM when allocating tensor with shape[800000,32,30,62]

This is a weird shape. If you're working with images you should normally have 3 or 1 channels. On top of that, it seems like you are passing your entire dataset at once, you should instead pass it in batches.




回答2:


From [800000,32,30,62] it seems your model put all the data in one batch.

Try specified batch size like

history = model.fit([trainimage, train_product_embd],train_label, validation_data=([validimage,valid_product_embd],valid_label), epochs=10, steps_per_epoch=100, validation_steps=10, batch_size=32)

If it still OOM then try reduce the batch_size




回答3:


Happened to me as well.

You can try reducing trainable parameters by using some form of Transfer Learning - try freezing the initial few layers and use lower batch sizes.



来源:https://stackoverflow.com/questions/59394947/how-to-fix-resourceexhaustederror-oom-when-allocating-tensor

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!