How to fix this strange error: “RuntimeError: CUDA error: out of memory”

妖精的绣舞 提交于 2020-06-27 08:15:26

问题


I ran a code about the deep learning network,first I trained the network,and it works well,but this error occurs when running to the validate network.

I have five epoch,every epoch has a process of training and validation. I met the error when validate in the first epoch. So I don not run the validate code, I found that code can run to the second epoch and have no error.

My code:

for epoch in range(10,15): # epoch: 10~15
    if(options["training"]["train"]):
        trainer.epoch(model, epoch)

    if(options["validation"]["validate"]):
    #if(epoch == 14):
        validator.epoch(model)

enter image description hereenter image description here

I feel the code of validation may have some bugs. But I can not find that.


回答1:


The error, which you has provided is shown, because you ran out of memory on your GPU. A way to solve it is to reduce the batch size until your code will run without this error.




回答2:


1.. When you only perform validation not training,
you don't need to calculate gradients for forward and backward phase.
In that situation, your code can be located under

with torch.no_grad():
    ...
    net=Net()
    pred_for_validation=net(input)
    ...

Above code doesn't use GPU memory

2.. If you use += operator in your code,
it can accumulate gradient continuously in your gradient graph.
In that case, you need to use float() like following site
https://pytorch.org/docs/stable/notes/faq.html#my-model-reports-cuda-runtime-error-2-out-of-memory

Even if docs guides with float(), in case of me, item() also worked like

entire_loss=0.0
for i in range(100):
    one_loss=loss_function(prediction,label)
    entire_loss+=one_loss.item()

3.. If you use for loop in training code,
data can be sustained until entire for loop ends.
So, in that case, you can explicitly delete variables after performing optimizer.step()

for one_epoch in range(100):
    ...
    optimizer.step()
    del intermediate_variable1,intermediate_variable2,...



回答3:


It might be for a number of reasons that I try to report in the following list:

  1. Modules parameters: check the number of dimensions for your modules. Linear layers that transform a big input tensor (e.g., size 1000) in another big output tensor (e.g., size 1000) will require a matrix whose size is (1000, 1000).
  2. RNN decoder maximum steps: if you're using an RNN decoder in your architecture, avoid looping for a big number of steps. Usually, you fix a given number of decoding steps that is reasonable for your dataset.
  3. Tensors usage: minimise the number of tensors that you create. The garbage collector won't release them until they go out of scope.
  4. Batch size: incrementally increase your batch size until you go out of memory. It's a common trick that even famous library implement (see the biggest_batch_first description for the BucketIterator in AllenNLP.

In addition, I would recommend you to have a look to the official PyTorch documentation: https://pytorch.org/docs/stable/notes/faq.html




回答4:


The best way is to find the process engaging gpu memory and kill it:

find the PID of python process from:

nvidia-smi

copy the PID and kill it by:

sudo kill -9 pid


来源:https://stackoverflow.com/questions/54374935/how-to-fix-this-strange-error-runtimeerror-cuda-error-out-of-memory

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!