Training broke with ResourceExausted error

依然范特西╮ 提交于 2019-11-27 23:38:36

I have been tweaking a lot these days to solve this problem.

Finally, I haven't solved the mystery of the memory size described in the question. I guess while computing the gradient tensoflow accumulate a lot of additional memory for computing gradient. I need to check the source of the tensorflow which seems very cumbersome at this time. You can check how much memory your model is using from terminal by the following command,

nvidia-smi

judging from this command you can guess how much additional memory you can use.

But the solution to these type of problem lies on reducing the batch size,

For my case reducing the size of the batch to 3 works. This may vary model to model.

But what if you are using a model where the embedding matrix is much bigger that you cannot load them into memory?

The solution is to write some painy code.

You have to lookup on the embedding matrix and then load the embedding to the model. In short, for each batch, you have to give the lookup matrixes to the model(feed them by the feed_dict argument in the sess.run()).

Next you will face a new problem,

You cannot make the embeddings trainable in this way. The solution is to use the embedding in a placeholder and assign them to a Variable(say for example A). After each batch of training, the learning algorithm updates the variable A. Then compute the output of A vector by tensorflow and assign them to your embedding matrix which is outside of the model. (I said that the process is painy)

Now your next question should be, what if you cannot feed the embedding lookup to the model because it's so big. This is a fundamental problem that you cannot avoid. That's why the NVIDIA GTX 1080, 1080ti and NVIDA TITAN Xp have so price difference though NVIDIA 1080ti and 1080 have the higher frequency to run an execution.

*Note *: The error occurred after 32 epoch. My question is why after 32 epoch there is an error. Why not at the initial epoch.

This is a major clue that the graph is not static during execution. By that I mean, you're likely doing sess.run(tf.something) instead of

my_something = tf.something
with tf.Session() as sess: 
    sess.run(my_something)

I ran into the same problem trying to implement a stateful RNN. I would occasionally reset the state, so I was doing sess.run([reset if some_condition else tf.no_op()]). Simply adding nothing = tf.no_op() to my graph and using sess.run([reset if some_condition else nothing]) solved my problem.

If you could post the training loop, it would be easier to tell if that is what's going wrong.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!