I was running TensorFlow and I happen to have something yielding a NaN. I\'d like to know what it is but I do not know how to do this. The main issue is that in a \"normal\"
For TensorFlow 2, inject some x=tf.debugging.check_numerics(x,'x is nan')
into your code. They will throw an InvalidArgument
error if x
has any values that are not a number (NaN) or infinity (Inf).
Oh and for the next person finding this when hunting a TF2 NaN issue, my case turned out to be an exploding gradient. The gradient itself got to 1e+20, which was not quite NaN yet, but adding that to the variable then turned out too big. The diagnosis that I did was
gradients = tape.gradient(loss, training_variables)
for g,v in zip(gradients, training_variables):
tf.print(v.name, tf.reduce_max(g))
optimizer.apply_gradients(zip(gradients, training_variables))
which revealed the overly large numbers. Running the exact same network on CPU worked fine, but it failed on the GTX 1080 TI in my workstation, thus making a CUDA numerical stability issue likely as the root cause. But since it only occurred sometimes, I duct-taped the whole thing by going with:
gradients = tape.gradient(loss, training_variables)
gradients = [tf.clip_by_norm(g, 10.0) for g in gradients]
optimizer.apply_gradients(zip(gradients, training_variables))
which will just clip exploding gradients to a sane value. For a network where gradients are always high, that wouldn't help, but since the magnitudes where high only sporadically, this fixed the problem and now the network trains nicely also on GPU.
I was able to fix my NaN issues by getting rid of all of my dropout layers in the network model. I suspected that maybe for some reason a unit (neuron?) in the network lost too many input connections (so it had zero after the dropout), so then when information was fed through, it had a value of NaN. I don't see how that could happen over and over again with dropout=0.8 on layers with more than a hundred units each, so the problem was probably fixed for a different reason. Either way, commenting out the dropout layers fixed my issue.
EDIT: Oops! I realized that I added a dropout layer after my final output layer which consists of three units. Now that makes more sense. So, don't do that!
NANs occurring in the forward process are one thing and those occurring in the backward process are another.
Make sure that there are no extreme inputs such as NAN inputs or negative labels in the prepared dataset using NumPy tools, for instance: assert not np.any(np.isnan(x))
.
Switch to a CPU environment to get a more detailed traceback, and test the forward pass only by loss = tf.stop_gradient(loss)
before calculating the gradients to see if you can run several batches with no errors. If an error occurs, there are several types of potential bugs and methods:
tensor = tf.check_numerics(tensor, 'tensor')
in some suspicious places.tf_debug
as written in this answer.If everything goes well, remove the loss = tf.stop_gradient(loss)
.
As an aside, it's always helpful to make sure that the shape of every tensor is desired. You can try to input fixed-sized batches(drop the remainders) and reshape the feature tensors(where the graph receives data from Dataset) as you expect them to be(otherwise the first dimension would be None sometimes) and then print the shape of the very tensor in the graph with fixed numbers.