How to fix “ResourceExhaustedError: OOM when allocating tensor”

问题

I wanna make a model with multiple inputs. So, I try to build a model like this.

# define two sets of inputs
inputA = Input(shape=(32,64,1))
inputB = Input(shape=(32,1024))

# CNN
x = layers.Conv2D(32, kernel_size = (3, 3), activation = 'relu')(inputA)
x = layers.Conv2D(32, (3,3), activation='relu')(x)
x = layers.MaxPooling2D(pool_size=(2,2))(x)
x = layers.Dropout(0.2)(x)
x = layers.Flatten()(x)
x = layers.Dense(500, activation = 'relu')(x)
x = layers.Dropout(0.5)(x)
x = layers.Dense(500, activation='relu')(x)
x = Model(inputs=inputA, outputs=x)

# DNN
y = layers.Flatten()(inputB)
y = Dense(64, activation="relu")(y)
y = Dense(250, activation="relu")(y)
y = Dense(500, activation="relu")(y)
y = Model(inputs=inputB, outputs=y)

# Combine the output of the two models
combined = concatenate([x.output, y.output])


# combined outputs
z = Dense(300, activation="relu")(combined)
z = Dense(100, activation="relu")(combined)
z = Dense(1, activation="softmax")(combined)

model = Model(inputs=[x.input, y.input], outputs=z)

model.summary()

opt = Adam(lr=1e-3, decay=1e-3 / 200)
model.compile(loss = 'sparse_categorical_crossentropy', optimizer = opt,
    metrics = ['accuracy'])

and the summary : _

But, when i try to train this model,

history = model.fit([trainimage, train_product_embd],train_label,
    validation_data=([validimage,valid_product_embd],valid_label), epochs=10, 
    steps_per_epoch=100, validation_steps=10)

the problem happens.... :

--------------------------------------------------------------------------- ResourceExhaustedError Traceback (most recent call last) in () ----> 1 history = model.fit([trainimage, train_product_embd],train_label, validation_data=([validimage,valid_product_embd],valid_label), epochs=10, steps_per_epoch=100, validation_steps=10)

4 frames /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py in call(self, *args, **kwargs) 1470 ret = tf_session.TF_SessionRunCallable(self._session._session, 1471
self._handle, args, -> 1472 run_metadata_ptr) 1473 if run_metadata: 1474
proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[800000,32,30,62] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node conv2d_1/convolution}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[metrics/acc/Mean_1/_185]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[800000,32,30,62] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node conv2d_1/convolution}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations. 0 derived errors ignored.

Thanks for reading and hopefully helping me :)

回答1:

OOM stands for "out of memory". Your GPU is running out of memory, so it can't allocate memory for this tensor. There are a few things you can do:

Decrease the number of neurons in your Dense, Conv2D layers
Use a smaller batch_size (or increase steps_per_epoch)
Use grayscale images (there will be one channel instead of three)
Reduce the number of layers
Use more MaxPooling2D layers, and increase their pool size
Use larger strides in your Conv2D layers
Reduce the size your images (you can use PIL or cv2 for that)
Apply dropout
Use smaller float precision, namely np.float32 if you accidentally used np.float64
If you're using a pre-trained model, freeze the first layers

There is more useful information in this error:

OOM when allocating tensor with shape[800000,32,30,62]

This is a weird shape. If you're working with images you should normally have 3 or 1 channels. On top of that, it seems like you are passing your entire dataset at once, you should instead pass it in batches.

回答2:

From [800000,32,30,62] it seems your model put all the data in one batch.

Try specified batch size like

history = model.fit([trainimage, train_product_embd],train_label, validation_data=([validimage,valid_product_embd],valid_label), epochs=10, steps_per_epoch=100, validation_steps=10, batch_size=32)

If it still OOM then try reduce the batch_size