问题
I am working on a large cuda app in C++ that runs various models and needs to completely release all GPU memory or the other operations will fail.
I am able to release all the memory after closing all tf sessions and running cudaDeviceReset(). But afterwards I cannot run any new tensorflow code and session creation will return nullptrs. I tried cudaDeviceSynchronize() before and after thinking that would help but no luck.
I figured the call to InitMain would re-initialize tensorflow but it seems not to if I call again after the reset. Is there a specific entry-point I can call to basically "start over" and have tensorflow once again able to use the GPU and produce Sessions?
Using C++, not python. Below is an example showing what tensorflow shows after I run the cudeDeviceReset() and then attempt to open a new Tensorflow Session in C++.
2018-10-04 17:01:19.225505: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-10-04 17:01:19.326074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-04 17:01:19.326091: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-10-04 17:01:19.326095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-10-04 17:01:19.326215: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9446 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-10-04 17:01:19.326554: E tensorflow/stream_executor/cuda/cuda_driver.cc:785] failed to memset memory: CUDA_ERROR_INVALID_VALUE
2018-10-04 17:01:19.326593: E tensorflow/core/common_runtime/direct_session.cc:154] Failed precondition: Failed to memcopy into scratch buffer for device 0
May be related, but the first run has extra lines at the beginning as if a one-time initialization perhaps has ran.... The second run lacks those lines.. Below is what the first looks like.
2018-10-04 17:01:17.253809: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-10-04 17:01:17.254173: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6325
pciBusID: 0000:01:00.0
totalMemory: 10.92GiB freeMemory: 9.76GiB
2018-10-04 17:01:17.254185: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-10-04 17:01:17.413712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-04 17:01:17.413733: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-10-04 17:01:17.413737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-10-04 17:01:17.413888: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9446 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
I have at least figured out how to release the memory properly and nvidia-smi shows my application drop the memory as expected after the reset. But it's still not useful if I can't do anything afterwards... Appreciate any help!
EDIT: Had a question about which code I'm using to establish the session. I am not using any custom session options...
tensorflow::NewSession(tensorflow::SessionOptions());
A basic reproduction is to simply load a graph into a session, close the session and graph, free the pointers, call cudaDeviceReset() and again try to open another session which will throw nullptrs and the above error messages.
来源:https://stackoverflow.com/questions/52652479/how-to-reuse-tensorflow-after-cudadevicereset-in-c