If I want to use the BatchNormalization function in Keras, then do I need to call it once only at the beginning?
I read this documentation for it: http://keras.io/la
This thread has some considerable debate about whether BN should be applied before non-linearity of current layer or to the activations of the previous layer.
Although there is no correct answer, the authors of Batch Normalization say that It should be applied immediately before the non-linearity of the current layer. The reason ( quoted from original paper) -
"We add the BN transform immediately before the nonlinearity, by normalizing x = Wu+b. We could have also normalized the layer inputs u, but since u is likely the output of another nonlinearity, the shape of its distribution is likely to change during training, and constraining its first and second moments would not eliminate the covariate shift. In contrast, Wu + b is more likely to have a symmetric, non-sparse distribution, that is “more Gaussian” (Hyv¨arinen & Oja, 2000); normalizing it is likely to produce activations with a stable distribution."