Why the gradient of categorical crossentropy loss with respect to logits is 0 with gradient tape in TF2.0?

问题

I am learning Tensorflow 2.0 and I am trying to figure out how Gradient Tapes work. I have this simple example, in which, I evaluate the cross entropy loss between logits and labels. I am wondering why the gradients with respect to logits is being zero. (Please look at the code below). The version of TF is tensorflow-gpu==2.0.0-rc0.

logits = tf.Variable([[1, 0, 0], [1, 0, 0], [1, 0, 0]], type=tf.float32)
labels = tf.constant([[1, 0, 0], [0, 1, 0], [0, 0, 1]],dtype=tf.float32)
with tf.GradientTape(persistent=True) as tape:
    loss = tf.reduce_sum(tf.losses.categorical_crossentropy(labels, logits))

grads = tape.gradient(loss, logits)
print(grads)

I am getting

 tf.Tensor(
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]], shape=(3, 3), dtype=float32)

as a result, but should not it tell me how much should I change logits in order to minimize the loss?

回答1:

When calculate the cross entropy loss, set from_logits=True in the tf.losses.categorical_crossentropy(). In default, it's false, which means you are directly calculate the cross entropy loss using -p*log(q). By setting the from_logits=True, you are using -p*log(softmax(q)) to calculate the loss.

Update:

Just find one interesting results.

logits = tf.Variable([[0.8, 0.1, 0.1]], dtype=tf.float32)
labels = tf.constant([[1, 0, 0]],dtype=tf.float32)

with tf.GradientTape(persistent=True) as tape:
    loss = tf.reduce_sum(tf.keras.losses.categorical_crossentropy(labels, logits, from_logits=False))

grads = tape.gradient(loss, logits)
print(grads)

The grads will be tf.Tensor([[-0.25 1. 1. ]], shape=(1, 3), dtype=float32)

Previously, I thought tensorflow will use loss=-\Sigma_i(p_i)\log(q_i) to calculate the loss, and if we derive on q_i, we will have the derivative be -p_i/q_i. So, the expected grads should be [-1.25, 0, 0]. But the output grads looks like all increased by 1. But it won't affect the optimization process.

For now, I'm still trying to figure out why the grads will be increased by one. After reading the source code of tf.categorical_crossentropy, I found that even though we set from_logits=False, it still normalize the probabilities. That will change the final gradient expression. Specifically, the gradient will be -p_i/q_i+p_i/sum_j(q_j). If p_i=1 and sum_j(q_j)=1, the final gradient will plus one. That's why the gradient will be -0.25, however, I haven't figured out why the last two gradients would be 1..

To prove that all gradients are increased by 1/sum_j(q_j),

logits = tf.Variable([[0.5, 0.1, 0.1]], dtype=tf.float32)
labels = tf.constant([[1, 0, 0]],dtype=tf.float32)

with tf.GradientTape(persistent=True) as tape:
    loss = tf.reduce_sum(tf.keras.losses.categorical_crossentropy(labels, logits, from_logits=False))

grads = tape.gradient(loss, logits)
print(grads)

The grads are tf.Tensor([[-0.57142866 1.4285713 1.4285713 ]], which should be [-2,0,0].

It shows that all gradients are increased by 1/(0.5+0.1+0.1). For the p_i==1, the gradient increased by 1/(0.5+0.1+0.1) makes sense to me. But I don't understand why p_i==0, the gradient is still increased by 1/(0.5+0.1+0.1).

来源：https://stackoverflow.com/questions/57892492/why-the-gradient-of-categorical-crossentropy-loss-with-respect-to-logits-is-0-wi

标签

tensorflow

gradient