I am a beginner in deep learning.I know in regular neural nets people use batch norm before activation and it will reduce the reliance on good weight initialization. I wonde
The answer is Yes and No.
Why Yes, according to the paper layer normalization, in section it clearly indicates the usage of BN in RNNs.
Why No? The distribution of output at each timestep has to be stored and calcualted to conduct BN. Imagine that you pad the sequence input so all examples have the same length, so if the predict case is longer than all training cases, at some time step you have no mean/std of the output distribution summarized from the SGD training procedure.
Meanwhile, at least in Keras, I believe the BN layer only consider the normalization in vertical direction, i.e., the sequence output. The horizontal direction, i.e., hidden_status, cell_status, are not normalized. Correct me if I an wrong here.
In multiple-layer RNNs, you may consider using layer normalization tricks.