tensorflow-GPU OOM issue after several epochs

微笑、不失礼 提交于 2020-08-27 18:42:19

问题


I used tensorflow to train CNN with Nvidia Geforce 1060 (6G memory), but I got a OOM exception.

The training process was fine on first two epochs, but got the OOM exception on the third epoch.

============================ 2017-10-27 11:47:30.219130: W tensorflow/core/common_runtime/bfc_allocator.cc:277] **********************************************************************************************xxxxxx 2017-10-27 11:47:30.265389: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[10,10,48,48,48] Traceback (most recent call last): File"/anaconda3/lib/python3.6/sitepackages/tensorflow/python/client/session.py", line 1327, in _do_call return fn(*args) File "/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1306, in _run_fn status, run_metadata) File "/anaconda3/lib/python3.6/contextlib.py", line 88, in exit next(self.gen) File "/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[10,10,48,48,48] [[Node: gradients_4/global/detector_scope/maxpool_conv3d_2/MaxPool3D_grad/MaxPool3DGrad = MaxPool3DGrad[T=DT_FLOAT, TInput=DT_FLOAT, data_format="NDHWC", ksize=[1, 2, 2, 2, 1], padding="VALID", strides=[1, 2, 2, 2, 1], _device="/job:localhost/replica:0/task:0/gpu:0"](global/detector_scope/maxpool_conv3d_2/transpose, global/detector_scope/maxpool_conv3d_2/MaxPool3D, gradients_4/global/detector_scope/maxpool_conv3d_2/transpose_1_grad/transpose)]] [[Node: Momentum_4/update/_540 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_1540_Momentum_4/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]

=============================

So, I am confused why I got this OOM exception on third epoch after it finishes processing the first two epochs.

Given that datasets are the same during each epoch, if I ran out of GPU memory, I should get the exception on the first epoch. But I did successfully finish two epochs. So, why did this happen later ?

Any suggestions, please ?


回答1:


There are two times you are likely to see OOM errors, as you first start training and after at least one epoch has completed.

The first situation is simply due to the model's memory size. For that, the easiest way is to reduce the batch size. If your model is really big and your batch size is now down to one, you still have a few options: reduce the size of hidden layers or move to a cloud instance with enough GPU or even CPU only execution so the static allocation of memory works.

For the second situation, you are likely running into a memory leak of sorts. Many training implementations use a callback on a hold-out dataset to get a validation score. This execution, say if called by Keras, may hold on to GPU session resources. These build up if not released and can cause a GPU instance to report OOM after several epochs. Others have suggested using a second GPU instance for the validation session, but I think a better approach is to have smarter validation callback session handling (specifically to release GPU session resources when each validation callback completes.)

Here is the pseudo code illustrating the callback problem. This callback leads to OOM:

my_models_validation_score = tf.get_some_v_score

This callback does not lead to OOM:

with tf.Session() as sess: 
    sess.run(get_some_v_score)

I invite others to help add to this response...



来源:https://stackoverflow.com/questions/46981853/tensorflow-gpu-oom-issue-after-several-epochs

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!