I am a beginner in deep learning.I know in regular neural nets people use batch norm before activation and it will reduce the reliance on good weight initialization. I wonde
In any non-recurrent network (convnet or not) when you do BN each layer gets to adjust the incoming scale and mean so the incoming distribution for each layer doesn't keep changing (which is what the authors of the BN paper claim is the advantage of BN).
The problem with doing this for the recurrent outputs of an RNN is that the parameters for the incoming distribution are now shared between all timesteps (which are effectively layers in backpropagation-through-time, or BPTT). So the distribution ends up being fixed across the temporal layers of BPTT. This may not make sense as there may be structure in the data that varies (in a non-random way) across the time series. For example, if the time series is a sentence certain words are much more likely to appear in the beginning or end. So having the distribution fixed might reduce the effectiveness of BN.