When I started to train some neural network, it met the CUDA_ERROR_OUT_OF_MEMORY
but the training could go on without error. Because I wanted to use gpu memory
I faced this issue when trying to train model back to back. I figured that the GPU memory wasn't available due to previous training run. So I found the easiest way would be to manually flush the GPU memory before every next training.
Use nvidia-smi to check the GPU memory usage:
nvidia-smi
nvidia-smi --gpu-reset
The above command may not work if other processes are actively using the GPU.
Alternatively you can use the following command to list all the processes that are using GPU:
sudo fuser -v /dev/nvidia*
And the output should look like this:
USER PID ACCESS COMMAND
/dev/nvidia0: root 2216 F...m Xorg
sid 6114 F...m krunner
sid 6116 F...m plasmashell
sid 7227 F...m akonadi_archive
sid 7239 F...m akonadi_mailfil
sid 7249 F...m akonadi_sendlat
sid 18120 F...m chrome
sid 18163 F...m chrome
sid 24154 F...m code
/dev/nvidiactl: root 2216 F...m Xorg
sid 6114 F...m krunner
sid 6116 F...m plasmashell
sid 7227 F...m akonadi_archive
sid 7239 F...m akonadi_mailfil
sid 7249 F...m akonadi_sendlat
sid 18120 F...m chrome
sid 18163 F...m chrome
sid 24154 F...m code
/dev/nvidia-modeset: root 2216 F.... Xorg
sid 6114 F.... krunner
sid 6116 F.... plasmashell
sid 7227 F.... akonadi_archive
sid 7239 F.... akonadi_mailfil
sid 7249 F.... akonadi_sendlat
sid 18120 F.... chrome
sid 18163 F.... chrome
sid 24154 F.... code
From here, I got the PID for the process which was holding the GPU memory, which in my case is 24154.
Use the following command to kill the process by its PID:
sudo kill -9 MY_PID
Replace MY_PID with the relevant PID.