Training broke with ResourceExausted error

I am new to tensorflow and Machine Learning. Recently I am working on a model. My model is like below,

Character level Embedding Vector -> Embedding lookup -> LSTM1
Word level Embedding Vector->Embedding lookup -> LSTM2
[LSTM1+LSTM2] -> single layer MLP-> softmax layer
[LSTM1+LSTM2] -> Single layer MLP-> WGAN discriminator
Code of he rnn model

while I'm working on this model I got the following error. I thought My batch is too big. Thus I tried to reduce the batch size from 20 to 10 but it doesn't work.

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[24760,100] [[Node: chars/bidirectional_rnn/bw/bw/while/bw/lstm_cell/split = Split[T=DT_FLOAT, num_split=4, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients_2/Add_3/y, chars/bidirectional_rnn/bw/bw/while/bw/lstm_cell/BiasAdd)]] [[Node: bi-lstm/bidirectional_rnn/bw/bw/stack/_167 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_636_bi-lstm/bidirectional_rnn/bw/bw/stack", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

tensor with shape[24760,100] means 2476000*32/8*1024*1024 = 9.44519043 MB memory. I am running the code on a titan X(11 GB) GPU. What could go wrong? Why this type of error occurred?

* Extra info *: the size of the LSTM1 is 100. for bidirectional LSTM it becomes 200. The size of the LSTM2 is 300. For Bidirectional LSTM it becomes 600.

*Note *: The error occurred after 32 epoch. My question is why after 32 epoch there is an error. Why not at the initial epoch.

I have been tweaking a lot these days to solve this problem.

Finally, I haven't solved the mystery of the memory size described in the question. I guess while computing the gradient tensoflow accumulate a lot of additional memory for computing gradient. I need to check the source of the tensorflow which seems very cumbersome at this time. You can check how much memory your model is using from terminal by the following command,

nvidia-smi

judging from this command you can guess how much additional memory you can use.

But the solution to these type of problem lies on reducing the batch size,

For my case reducing the size of the batch to 3 works. This may vary model to model.

But what if you are using a model where the embedding matrix is much bigger that you cannot load them into memory?

The solution is to write some painy code.

You have to lookup on the embedding matrix and then load the embedding to the model. In short, for each batch, you have to give the lookup matrixes to the model(feed them by the feed_dict argument in the sess.run()).

Next you will face a new problem,

You cannot make the embeddings trainable in this way. The solution is to use the embedding in a placeholder and assign them to a Variable(say for example A). After each batch of training, the learning algorithm updates the variable A. Then compute the output of A vector by tensorflow and assign them to your embedding matrix which is outside of the model. (I said that the process is painy)

Now your next question should be, what if you cannot feed the embedding lookup to the model because it's so big. This is a fundamental problem that you cannot avoid. That's why the NVIDIA GTX 1080, 1080ti and NVIDA TITAN Xp have so price difference though NVIDIA 1080ti and 1080 have the higher frequency to run an execution.

*Note *: The error occurred after 32 epoch. My question is why after 32 epoch there is an error. Why not at the initial epoch.

This is a major clue that the graph is not static during execution. By that I mean, you're likely doing sess.run(tf.something) instead of

my_something = tf.something
with tf.Session() as sess: 
    sess.run(my_something)

I ran into the same problem trying to implement a stateful RNN. I would occasionally reset the state, so I was doing sess.run([reset if some_condition else tf.no_op()]). Simply adding nothing = tf.no_op() to my graph and using sess.run([reset if some_condition else nothing]) solved my problem.

If you could post the training loop, it would be easier to tell if that is what's going wrong.

来源：https://stackoverflow.com/questions/48007984/training-broke-with-resourceexausted-error

标签

python

tensorflow

batch-processing