CUDA_ERROR_OUT_OF_MEMORY in tensorflow

后端 未结 6 2020
盖世英雄少女心
盖世英雄少女心 2020-12-04 19:43

When I started to train some neural network, it met the CUDA_ERROR_OUT_OF_MEMORY but the training could go on without error. Because I wanted to use gpu memory

6条回答
  •  失恋的感觉
    2020-12-04 20:17

    I faced this issue when trying to train model back to back. I figured that the GPU memory wasn't available due to previous training run. So I found the easiest way would be to manually flush the GPU memory before every next training.

    Use nvidia-smi to check the GPU memory usage:

    nvidia-smi
    
    nvidia-smi --gpu-reset
    

    The above command may not work if other processes are actively using the GPU.

    Alternatively you can use the following command to list all the processes that are using GPU:

    sudo fuser -v /dev/nvidia*
    

    And the output should look like this:

    USER        PID ACCESS COMMAND
    /dev/nvidia0:        root       2216 F...m Xorg
                         sid        6114 F...m krunner
                         sid        6116 F...m plasmashell
                         sid        7227 F...m akonadi_archive
                         sid        7239 F...m akonadi_mailfil
                         sid        7249 F...m akonadi_sendlat
                         sid       18120 F...m chrome
                         sid       18163 F...m chrome
                         sid       24154 F...m code
    /dev/nvidiactl:      root       2216 F...m Xorg
                         sid        6114 F...m krunner
                         sid        6116 F...m plasmashell
                         sid        7227 F...m akonadi_archive
                         sid        7239 F...m akonadi_mailfil
                         sid        7249 F...m akonadi_sendlat
                         sid       18120 F...m chrome
                         sid       18163 F...m chrome
                         sid       24154 F...m code
    /dev/nvidia-modeset: root       2216 F.... Xorg
                         sid        6114 F.... krunner
                         sid        6116 F.... plasmashell
                         sid        7227 F.... akonadi_archive
                         sid        7239 F.... akonadi_mailfil
                         sid        7249 F.... akonadi_sendlat
                         sid       18120 F.... chrome
                         sid       18163 F.... chrome
                         sid       24154 F.... code
    

    From here, I got the PID for the process which was holding the GPU memory, which in my case is 24154.

    Use the following command to kill the process by its PID:

    sudo kill -9 MY_PID
    

    Replace MY_PID with the relevant PID.

提交回复
热议问题