Any non-zero recurrent_dropout
yields NaN losses and weights; latter are either 0 or NaN. Happens for stacked, shallow, stateful
, return_sequ
Studying LSTM formulae deeper and digging into the source code, everything's come crystal clear - and if it isn't to you just from reading the question, then you have something to learn from this answer.
Verdict: recurrent_dropout
has nothing to do with it; a thing's being looped where none expect it.
Actual culprit: the activation
argument, now 'relu'
, is applied on the recurrent transformations - contrary to virtually every tutorial showing it as the harmless 'tanh'
.
I.e., activation
is not only for the hidden-to-output transform - source code; it operates directly on computing both recurrent states, cell and hidden:
c = f * c_tm1 + i * self.activation(x_c + K.dot(h_tm1_c, self.recurrent_kernel_c))
h = o * self.activation(c)
Solution(s):
BatchNormalization
to LSTM's inputs, especially if previous layer's outputs are unbounded (ReLU, ELU, etc)
activation=None
, then BN, then Activation
layer)activation='selu'
; more stable, but can still divergelr
More answers, to some remaining questions:
recurrent_dropout
suspected? Unmeticulous testing setup; only now did I focus on forcing divergence without it. It did however, sometimes accelerate divergence - which may be explained by by it zeroing the non-relu contributions that'd otherwise offset multiplicative reinforcement.UPDATE 1/22/2020: recurrent_dropout
may in fact be a contributing factor, as it utilizes inverted dropout, upscaling hidden transformations during training, easing divergent behavior over many timesteps. Git Issue on this here