How does one debug NaN values in TensorFlow?

后端 未结 9 1420
遥遥无期
遥遥无期 2020-12-23 09:07

I was running TensorFlow and I happen to have something yielding a NaN. I\'d like to know what it is but I do not know how to do this. The main issue is that in a \"normal\"

9条回答
  •  春和景丽
    2020-12-23 10:09

    NANs occurring in the forward process are one thing and those occurring in the backward process are another.

    Step 0: data

    Make sure that there are no extreme inputs such as NAN inputs or negative labels in the prepared dataset using NumPy tools, for instance: assert not np.any(np.isnan(x)).

    Step 1: the forward

    Switch to a CPU environment to get a more detailed traceback, and test the forward pass only by loss = tf.stop_gradient(loss) before calculating the gradients to see if you can run several batches with no errors. If an error occurs, there are several types of potential bugs and methods:

    1. 0 in the log for the cross-entropy loss functions(please refer to this answer)
    2. 0/0 problem
    3. out of class problem as issued here.
    4. try tensor = tf.check_numerics(tensor, 'tensor') in some suspicious places.
    5. try tf_debug as written in this answer.

    Step 2: the backward

    If everything goes well, remove the loss = tf.stop_gradient(loss).

    1. try very small learning rate
    2. replace complex blocks of code by simple computations, like full connection, with the same shape of inputs and outputs to zoom in where the bug lies. You may encounter backward bugs like this.

    As an aside, it's always helpful to make sure that the shape of every tensor is desired. You can try to input fixed-sized batches(drop the remainders) and reshape the feature tensors(where the graph receives data from Dataset) as you expect them to be(otherwise the first dimension would be None sometimes) and then print the shape of the very tensor in the graph with fixed numbers.

提交回复
热议问题