How does one debug NaN values in TensorFlow?

后端未结

关注

 9  1428

I was running TensorFlow and I happen to have something yielding a NaN. I\'d like to know what it is but I do not know how to do this. The main issue is that in a \"normal\"

相关标签:

9条回答

遥遥无期

2020-12-23 09:51
First of all, you need to check you input data properly. In most cases this is the reason. But not always, of course.

I usually use Tensorboard to see whats happening while training. So you can see the values on each step with
```
Z = tf.pow(Z, 2.0)    
summary_z = tf.scalar_summary('z', Z) 
#etc..
summary_merge = tf.merge_all_summaries()
#on each desired step save: 
    summary_str = sess.run(summary_merge)
    summary_writer.add_summary(summary_str, i)
```
Also you can simply eval and print the current value:
```
 print(sess.run(Z))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
忘掉有多难

2020-12-23 09:56

It look like you can call it after you complete making the graph.

check = tf.add_check_numerics_ops()

I think this will add the check for all floating point operations. Then in the sessions run function you can add the check operation.

sess.run([check, ...])

0 讨论(0)
发布评论:

提交评论
- 加载中...
时光取名叫无心

2020-12-23 09:58

As of version 0.12, TensorFlow is shipped with a builtin debugger called tfdbg. It optimizes the workflow of debugging this type of bad-numerical-value issues (like inf and nan). The documentation is at: https://www.tensorflow.org/programmers_guide/debugger

0 讨论(0)
发布评论:

提交评论
- 加载中...
醉话见心

2020-12-23 09:59

Current implementation of tfdbg.has_inf_or_nan seems do not break immediately on hitting any tensor containing NaN. When it does stop, the huge list of tensors displayed are not sorted in order of its execution. A possible hack to find the first appearance of Nans is to dump all tensors to a temporary directory and inspect afterwards. Here is a quick-and-dirty example to do that. (Assuming the NaNs appear in the first few runs)

0 讨论(0)
发布评论:

提交评论
- 加载中...
既然无缘

2020-12-23 10:02
There are a couple of reasons WHY you can get a NaN-result, often it is because of too high a learning rate but plenty other reasons are possible like for example corrupt data in your input-queue or a log of 0 calculation.

Anyhow, debugging with a print as you describe cannot be done by a simple print (as this would result only in the printing of the tensor-information inside the graph and not print any actual values).

However, if you use tf.print as an op in bulding the graph (tf.print) then when the graph gets executed you will get the actual values printed (and it IS a good exercise to watch these values to debug and understand the behavior of your net).

However, you are using the print-statement not entirely in the correct manner. This is an op, so you need to pass it a tensor and request a result-tensor that you need to work with later on in the executing graph. Otherwise the op is not going to be executed and no printing occurs. Try this:
```
Z = tf.sqrt(Delta_tilde)
Z = tf.Print(Z,[Z], message="my Z-values:") # <-------- TF PRINT STATMENT
Z = Transform(Z) # potentially some transform, currently I have it to return Z for debugging (the identity)
Z = tf.pow(Z, 2.0)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
梦如初夏

2020-12-23 10:03
I used to find it's much tougher to pinpoint where the nans and infs may occur than to fix the bug. As a complementary to @scai's answer, I'd like to add some points here:

The debug module, you can imported by:
```
from tensorflow.python import debug as tf_debug
```
is much better than any print or assert.

You can just add the debug function by changing your wrapper you session by:
```
sess = tf_debug.LocalCLIDebugWrapperSession(sess)
sess.add_tensor_filter("has_inf_or_nan", tf_debug.has_inf_or_nan)
```
And you'll prompt an command line interface, then you enter: run -f has_inf_or_nan and lt -f has_inf_or_nan to find where the nans or infs are. The first one is the first place where the catastrophe occurs. By the variable name you can trace the origin in your code.

Reference: https://developers.googleblog.com/2017/02/debug-tensorflow-models-with-tfdbg.html
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页