How does one debug NaN values in TensorFlow?

后端未结

关注

 9  1429

I was running TensorFlow and I happen to have something yielding a NaN. I\'d like to know what it is but I do not know how to do this. The main issue is that in a \"normal\"

相关标签:

9条回答

北恋

2020-12-23 10:03
For TensorFlow 2, inject some x=tf.debugging.check_numerics(x,'x is nan') into your code. They will throw an InvalidArgument error if xhas any values that are not a number (NaN) or infinity (Inf).

Oh and for the next person finding this when hunting a TF2 NaN issue, my case turned out to be an exploding gradient. The gradient itself got to 1e+20, which was not quite NaN yet, but adding that to the variable then turned out too big. The diagnosis that I did was
```
gradients = tape.gradient(loss, training_variables)
for g,v in zip(gradients, training_variables):
  tf.print(v.name, tf.reduce_max(g))
optimizer.apply_gradients(zip(gradients, training_variables))
```
which revealed the overly large numbers. Running the exact same network on CPU worked fine, but it failed on the GTX 1080 TI in my workstation, thus making a CUDA numerical stability issue likely as the root cause. But since it only occurred sometimes, I duct-taped the whole thing by going with:
```
gradients = tape.gradient(loss, training_variables)
gradients = [tf.clip_by_norm(g, 10.0) for g in gradients]
optimizer.apply_gradients(zip(gradients, training_variables))
```
which will just clip exploding gradients to a sane value. For a network where gradients are always high, that wouldn't help, but since the magnitudes where high only sporadically, this fixed the problem and now the network trains nicely also on GPU.
0 讨论(0)
发布评论:

提交评论
- 加载中...
慢半拍i

2020-12-23 10:04

I was able to fix my NaN issues by getting rid of all of my dropout layers in the network model. I suspected that maybe for some reason a unit (neuron?) in the network lost too many input connections (so it had zero after the dropout), so then when information was fed through, it had a value of NaN. I don't see how that could happen over and over again with dropout=0.8 on layers with more than a hundred units each, so the problem was probably fixed for a different reason. Either way, commenting out the dropout layers fixed my issue.

EDIT: Oops! I realized that I added a dropout layer after my final output layer which consists of three units. Now that makes more sense. So, don't do that!

0 讨论(0)
发布评论:

提交评论
- 加载中...
春和景丽

2020-12-23 10:09
NANs occurring in the forward process are one thing and those occurring in the backward process are another.

Step 0: data

Make sure that there are no extreme inputs such as NAN inputs or negative labels in the prepared dataset using NumPy tools, for instance: assert not np.any(np.isnan(x)).

Step 1: the forward

Switch to a CPU environment to get a more detailed traceback, and test the forward pass only by loss = tf.stop_gradient(loss) before calculating the gradients to see if you can run several batches with no errors. If an error occurs, there are several types of potential bugs and methods:
1. 0 in the log for the cross-entropy loss functions(please refer to this answer)
2. 0/0 problem
3. out of class problem as issued here.
4. try tensor = tf.check_numerics(tensor, 'tensor') in some suspicious places.
5. try tf_debug as written in this answer.
Step 2: the backward

If everything goes well, remove the loss = tf.stop_gradient(loss).
1. try very small learning rate
2. replace complex blocks of code by simple computations, like full connection, with the same shape of inputs and outputs to zoom in where the bug lies. You may encounter backward bugs like this.
As an aside, it's always helpful to make sure that the shape of every tensor is desired. You can try to input fixed-sized batches(drop the remainders) and reshape the feature tensors(where the graph receives data from Dataset) as you expect them to be(otherwise the first dimension would be None sometimes) and then print the shape of the very tensor in the graph with fixed numbers.
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2

How does one debug NaN values in TensorFlow?

Step 0: data

Step 1: the forward

Step 2: the backward