I\'m running tensorflow-gpu on Windows 10 using a simple MINST neural network program. When it tries to run, it encounters a CUBLAS_STATUS_ALLOC_FAILED
error. A
There are at least 2 distinct problems here. The first is when a previously run python process is subsequently re-run, and GPU memory has not been freed from the previous run. You can tell this is happening, as when the python process appears it is instantly consuming a huge amount of RAM and will fail when it goes to acquire some more. In the attached screen grab ~6GB is acquired on startup. Check the GPU memory by using the task manager in Windows, the Dedicated GPU Memory Column under the details tab. In this case, reboot the PC, as the problem is caused by running out of GPU memory. TF is designed not to free memory during a session as it will lead to fragmentation, so it looks like the IPython/Python session is holding the TF instance and not freeing the memory from the last run. In my case using Pycharm with an IPython session, repeatedly running it eventually leads to all my RAM being grabbed on startup statically, with little left for growth dynamically.
The second problem is when the GPU device is configured wrong. Depending on the TF version and how many devices you are using, you may need to set the GPU memory to have the same policy across multiple devices. The policy is to either allow the GPU memory to grow during a session, or grab as much as possible on startup. Various fixes are listed above, choose the one that fits the TF version you're using, and whether you have >1 device or not.