Tensorflow: GPU Utilization is almost always at 0%

百般思念 提交于 2019-12-03 03:35:34

After doing some experiments, I found the answer so I post it since it could be useful to someone else.

First, get_next_batch is approximately 15x slower than train_op (thanks to Eric Platon for pointing this out).

However, I thought that the queue was being fed up to capacity and that only after the training was supposed to begin. Hence, I thought that even if get_next_batch was way slower, the queue should hide this latency, in the beginning at least, since it holds capacity examples and it would need to fetch new data only after it reaches min_after_dequeue which is lower than capacity and that it would result in a somehow steady GPU utilization.

But actually, the training begins as soon as the queue reaches min_after_dequeue examples. Thus, the queue is being dequeued as soon as the queue reaches min_after_dequeue examples to run the train_op, and since the time to feed the queue is 15x slower than the execution time of train_op, the number of elements in the queue drops below min_after_dequeue right after the first iteration of the train_op and the train_op has to wait for the queue to reach again min_after_dequeue examples.

When I force the train_op to wait until the queue is fed up to capacity (with capacity = 100*batch) instead of starting automatically when it reaches min_after_dequeue (with min_after_dequeue=80*batch), the GPU utilization is steady for like 10 seconds before going back to 0%, which is understandable since the queue reaches min_after_dequeue example in less than 10 seconds.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!