Why does the first LSTM in a Keras model have more params than the subsequent one?

问题

I was just looking at the Keras model details from a fairly straightforward sequential model where I have multiple LSTM layers, one after another. I was surprised to see that the first layer always has more params despite having the same definition as the subsequent LSTM layer.

The model definition here shows it clearly:

Layer (type)                 Output Shape              Param #   
=================================================================
lstm_1 (LSTM)                (None, 400, 5)            380       
_________________________________________________________________
lstm_2 (LSTM)                (None, 400, 5)            220       
_________________________________________________________________
time_distributed_1 (TimeDist (None, 400, 650)          3900      
_________________________________________________________________
lstm_3 (LSTM)                (None, 400, 20)           53680     
_________________________________________________________________
lstm_4 (LSTM)                (None, 400, 20)           3280      
_________________________________________________________________

Similarly after a time distributed dense layer, the same is true of the next two identical LSTMs.

Is my understanding of LSTMs incorrect, that an identical definition doesn't result in the same layer being produced as a 'duplicate' tagged on the end, or is there something else in the param count that I need to understand? Currently it just looks weird to me!

Any explanation would be great to help me (a) understand better, and (b) build more performant models based on this new knowledge.

回答1:

The output of the LSTM depends only on its units.

We see that your first layers have both 5 units.
And the other two have 20 units.

But the trainable parameters (numbers to perform calculations on the inputs and bring the expected output), those need to take into account how many input features are coming, since they will have to consider all inputs in their calculations.

The bigger the input, the more parameters are necessary. We can tell that you have more than 5 features in the input. And for the two last layers, the input of the first one is 650, against 20 of the other.

Detailed amount of parameters.

In the LSTM layer, as you can see in their code, there are 3 groups of weights:

kernel - shaped as (units, 4*inputs)
recurrent kernel - shaped as (units,4*units)
bias - shaped as (4*units,)

With some calculations, we can infer that your inputs have shape (None, 400, 13)

Layer (type)         Output Shape        Param #   
========================================================================
input_6 (InputLayer) (None, 400, 13)     0         
________________________________________________________________________
lstm_1 (LSTM)        (None, 400, 5)      380   = 4*(13*5 + 5*5 + 5)   
________________________________________________________________________
lstm_2 (LSTM)        (None, 400, 5)      220   = 4*(5*5 + 5*5 + 5)     
________________________________________________________________________
time_distributed_1   (None, 400, 650)    3900  = ?  
________________________________________________________________________
lstm_3 (LSTM)        (None, 400, 20)     53680 = 4*(650*20 + 20*20 + 20)   
________________________________________________________________________
lstm_4 (LSTM)        (None, 400, 20)     3280  = 4*(20*20 + 20*20 + 20)    
________________________________________________________________________

LSTM 1 parameters = 4*(13*5 + 5*5 + 5)
LSTM 2 parameters = 4*(5*5 + 5*5 + 5)
Time distributed = ??
LSTM 3 parameters = 4*(650*20 + 20*20 + 20)
LSTM 4 parameters = 4*(20*20 + 20*20 + 20)

Other layers have similar behavior

If you test with a dense layer, you will also see that:

Layer (type)         Output Shape    Param #   
=========================================================
input_6 (InputLayer) (None, 13)      0      
_________________________________________________________
dense_1 (Dense)      (None, 5)       70    = 13*5 + 5      
_________________________________________________________
dense_2 (Dense)      (None, 5)       30    = 5*5 + 5   
_________________________________________________________
dense_3 (Dense)      (None, 650)     3900  = 5*650 + 650     
_________________________________________________________
dense_4 (Dense)      (None, 20)      13020 = 650*20 + 20   
_________________________________________________________
dense_5 (Dense)      (None, 20)      420   = 20*20 + 20    
=========================================================

The difference is that the dense layers don't have a recurrent kernel, and their kernels are not multiplied by 4.

Dense 1 parameters = 13*5 + 5
Dense 2 parameters = 5*5 + 5
Dense 3 parameters = 5*650 + 650
Dense 4 parameters = 650*20 + 20
Dense 5 parameters = 20*20 + 20

来源：https://stackoverflow.com/questions/46584171/why-does-the-first-lstm-in-a-keras-model-have-more-params-than-the-subsequent-on

标签

keras

keras-layer