Computing derivative wrt to the input of a network with batchnormalization : training vs inference time

前端未结

关注

 0  1608

I am noticing a different behavior when I try to compute the derivative of a network output with respect to its input when this network has a Batch Normalization layer.