问题
I ran a code about the deep learning network,first I trained the network,and it works well,but this error occurs when running to the validate network.
I have five epoch,every epoch has a process of training and validation. I met the error when validate in the first epoch. So I don not run the validate code, I found that code can run to the second epoch and have no error.
My code:
for epoch in range(10,15): # epoch: 10~15
if(options["training"]["train"]):
trainer.epoch(model, epoch)
if(options["validation"]["validate"]):
#if(epoch == 14):
validator.epoch(model)


I feel the code of validation may have some bugs. But I can not find that.
回答1:
The error, which you has provided is shown, because you ran out of memory on your GPU. A way to solve it is to reduce the batch size until your code will run without this error.
回答2:
1.. When you only perform validation not training,
you don't need to calculate gradients for forward and backward phase.
In that situation, your code can be located under
with torch.no_grad():
...
net=Net()
pred_for_validation=net(input)
...
Above code doesn't use GPU memory
2.. If you use += operator in your code,
it can accumulate gradient continuously in your gradient graph.
In that case, you need to use float() like following site
https://pytorch.org/docs/stable/notes/faq.html#my-model-reports-cuda-runtime-error-2-out-of-memory
Even if docs guides with float(), in case of me, item() also worked like
entire_loss=0.0
for i in range(100):
one_loss=loss_function(prediction,label)
entire_loss+=one_loss.item()
3.. If you use for loop in training code,
data can be sustained until entire for loop ends.
So, in that case, you can explicitly delete variables after performing optimizer.step()
for one_epoch in range(100):
...
optimizer.step()
del intermediate_variable1,intermediate_variable2,...
回答3:
It might be for a number of reasons that I try to report in the following list:
- Modules parameters: check the number of dimensions for your modules. Linear layers that transform a big input tensor (e.g., size 1000) in another big output tensor (e.g., size 1000) will require a matrix whose size is (1000, 1000).
- RNN decoder maximum steps: if you're using an RNN decoder in your architecture, avoid looping for a big number of steps. Usually, you fix a given number of decoding steps that is reasonable for your dataset.
- Tensors usage: minimise the number of tensors that you create. The garbage collector won't release them until they go out of scope.
- Batch size: incrementally increase your batch size until you go out of memory. It's a common trick that even famous library implement (see the
biggest_batch_firstdescription for the BucketIterator in AllenNLP.
In addition, I would recommend you to have a look to the official PyTorch documentation: https://pytorch.org/docs/stable/notes/faq.html
回答4:
The best way is to find the process engaging gpu memory and kill it:
find the PID of python process from:
nvidia-smi
copy the PID and kill it by:
sudo kill -9 pid
来源:https://stackoverflow.com/questions/54374935/how-to-fix-this-strange-error-runtimeerror-cuda-error-out-of-memory