Resource exhausted: OOM when allocating tensor only on gpu

不打扰是莪最后的温柔 提交于 2021-01-05 13:04:04

问题


I'm trying to run several different ML architectures,
all vanilla, without any modification (git clone -> python train.py).
while the result is always the same- segmentation fault, or Resource exhausted: OOM when allocating tensor.
When running only on my CPU, the program finishes successfully
I'm running the session with

    config.gpu_options.per_process_gpu_memory_fraction=0.33
    config.gpu_options.allow_growth = True
    config.allow_soft_placement = True
    config.log_device_placement = True

And yet, the result is

2019-03-11 20:23:26.845851: W tensorflow/core/common_runtime/bfc_allocator.cc:271] ***************************************************************x**********____**********____**_____*
2019-03-11 20:23:26.845885: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[32,128,1024,40] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):

2019-03-11 20:23:16.841149: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.59GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-11 20:23:16.841191: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.59GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-11 20:23:26.841486: W tensorflow/core/common_runtime/bfc_allocator.cc:267] Allocator (GPU_0_bfc) ran out of memory trying to allocate 640.00MiB.  Current allocation summary follows.
2019-03-11 20:23:26.841566: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (256):   Total Chunks: 195, Chunks in use: 195. 48.8KiB allocated for chunks. 48.8KiB in use in bin. 23.3KiB client-requested in use in bin.

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[32,128,1024,40] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node transform_net1/tconv2/bn/moments/SquaredDifference (defined at /home/dvir/CLionProjects/gml/Dvir/FlexKernels/utils/tf_util.py:504)  = SquaredDifference[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transform_net1/tconv2/BiasAdd, transform_net1/tconv2/bn/moments/mean)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
     [[{{node div/_113}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1730_div", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

I'm running with

tensorflow-gpu 1.12
tensorflow 1.13

GPU is

GeForce RTX 2080TI

The model is Dynamic Graph CNN for Learning on Point Clouds, and was tested successfully on another machine with 1080 ti.


回答1:


For TensorFlow 2.2.0 this script works -

if tf.config.list_physical_devices('GPU'):
    physical_devices = tf.config.list_physical_devices('GPU')
    tf.config.experimental.set_memory_growth(physical_devices[0], enable=True)
    tf.config.experimental.set_virtual_device_configuration(physical_devices[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=4000)])

https://stackoverflow.com/a/63123354/5884380




回答2:


As explained here the line config.gpu_options.per_process_gpu_memory_fraction=0.33 determines the fraction of the overall amount of memory from visible GPU should be allocated (33% for you case). Increasing this value or removing this line(100%) will give more of needed memory.



来源:https://stackoverflow.com/questions/55120206/resource-exhausted-oom-when-allocating-tensor-only-on-gpu

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!