Nan in summary histogram

匿名 (未验证) 提交于 2019-12-03 01:00:01

问题:

My program will face this some times(not every run will face this..), then if face this I can always reproduce this error loading from the last model I have saved before program crash due to nan. When rerun from this model, first train process seems fine using the model to generate loss(I have printed loss and shows no problem), but after applying gradients, the values of embedding variables will turn to Nan.

So what is the root cause of the nan problem? Confused as not know how to debug further and this program with same data and params will mostly run ok and only face this problem during some run..

Loading existing model from: /home/gezi/temp/image-caption//model.flickr.rnn2.nan/model.ckpt-18000 Train from restored model: /home/gezi/temp/image-caption//model.flickr.rnn2.nan/model.ckpt-18000 I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 5235 get requests, put_count=4729 evicted_count=1000 eviction_rate=0.211461 and unsatisfied allocation rate=0.306781 I tensorflow/core/common_runtime/gpu/pool_allocator.cc:257] Raising pool_size_limit_ from 100 to 110 2016-10-04 21:45:39 epoch:1.87 train_step:18001 duration:0.947 elapsed:0.947 train_avg_metrics:['loss:0.527']  ['loss:0.527'] 2016-10-04 21:45:39 epoch:1.87 eval_step: 18001 duration:0.001 elapsed:0.948 ratio:0.001 W tensorflow/core/framework/op_kernel.cc:968] Invalid argument: Nan in summary histogram for: rnn/HistogramSummary_1      [[Node: rnn/HistogramSummary_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](rnn/HistogramSummary_1/tag, rnn/image_text_sim/image_mlp/w_h/read/_309)]] W tensorflow/core/framework/op_kernel.cc:968] Invalid argument: Nan in summary histogram for: rnn/HistogramSummary_1      [[Node: rnn/HistogramSummary_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](rnn/HistogramSummary_1/tag, rnn/image_text_sim/image_mlp/w_h/read/_309)]] W tensorflow/core/framework/op_kernel.cc:968] Invalid argument: Nan in summary histogram for: rnn/HistogramSummary_1      [[Node: rnn/HistogramSummary_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](rnn/HistogramSummary_1/tag, rnn/image_text_sim/image_mlp/w_h/read/_309)]] W tensorflow/core/framework/op_kernel.cc:968] Invalid argument: Nan in summary histogram for: rnn/HistogramSummary_1      [[Node: rnn/HistogramSummary_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](rnn/HistogramSummary_1/tag, rnn/image_text_sim/image_mlp/w_h/read/_309)]] W tensorflow/core/framework/op_kernel.cc:968] Invalid argument: Nan in summary histogram for: rnn/HistogramSummary_1      [[Node: rnn/HistogramSummary_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](rnn/HistogramSummary_1/tag, rnn/image_text_sim/image_mlp/w_h/read/_309)]] Traceback (most recent call last):   File "./train.py", line 308, in <module>     tf.app.run() 

回答1:

Usually NaN is a sign of model instability, for example, exploding gradients. It may be unnoticed, loss would just stop shrinking. Trying to log weights summary makes the problem explicit. I suggest you to reduce the learning rate as a first measure. If it wouldn't help, post your code here. Without seeing it it's hard suggest anything more specific.



回答2:

It happens sometimes during the initial iterations of training that the model might spew out only a single prediction class. If out of random chance, the class turned out to be 0 for all the training examples then there can exist a NaN value for Categorical Cross Entropy Loss.

Make sure that you introduce a small value when computing the loss such as tf.log(predictions + 1e-8). This will help in overcoming this numerical instability.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!