LSTM 'recurrent_dropout' with 'relu' yields NaNs

前端未结

关注

 1  1136

Any non-zero recurrent_dropout yields NaN losses and weights; latter are either 0 or NaN. Happens for stacked, shallow, stateful, return_sequ


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  轻奢々        
                
              
                            
                2020-12-10 22:03
              
            
            
                                                                       
Studying LSTM formulae deeper and digging into the source code, everything's come crystal clear - and if it isn't to you just from reading the question, then you have something to learn from this answer.

Verdict: recurrent_dropout has nothing to do with it; a thing's being looped where none expect it.



Actual culprit: the activation argument, now 'relu', is applied on the recurrent transformations - contrary to virtually every tutorial showing it as the harmless 'tanh'.

I.e., activation is not only for the hidden-to-output transform -  source code; it operates directly on computing both recurrent states, cell and hidden:

c = f * c_tm1 + i * self.activation(x_c + K.dot(h_tm1_c, self.recurrent_kernel_c))
h = o * self.activation(c)






Solution(s):


Apply BatchNormalization to LSTM's inputs, especially if previous layer's outputs are unbounded (ReLU, ELU, etc)


If previous layer's activations are tightly bounded (e.g. tanh, sigmoid), apply BN before activations (use activation=None, then BN, then Activation layer)

Use activation='selu'; more stable, but can still diverge
Use lower lr
Apply gradient clipping
Use fewer timesteps




More answers, to some remaining questions:


Why was recurrent_dropout suspected? Unmeticulous testing setup; only now did I focus on forcing divergence without it. It did however, sometimes accelerate divergence - which may be explained by by it zeroing the non-relu contributions that'd otherwise offset multiplicative reinforcement.
Why do nonzero mean inputs accelerate divergence? Additive symmetry; nonzero-mean distributions are asymmetric, with one sign dominating - facilitating large pre-activations, hence large ReLUs.
Why can training be stable for hundreds of iterations with a low lr? Extreme activations induce large gradients via large error; with a low lr, this means weights adjust to prevent such activations - whereas a high lr jumps too far too quickly.
Why do stacked LSTMs diverge faster? In addition to feeding ReLUs to itself, LSTM feeds the next LSTM, which then feeds itself the ReLU'd ReLU's --> fireworks.




UPDATE 1/22/2020: recurrent_dropout may in fact be a contributing factor, as it utilizes inverted dropout, upscaling hidden transformations during training, easing divergent behavior over many timesteps. Git Issue on this here
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复