Confused about how to implement time-distributed LSTM + LSTM

问题

After much reading and diagramming, I think I've come up with a model that I can use to as the foundation for more testing on which parameters and features I need to tweak. However, I am confused about how to implement the following test case (all numbers are orders of magnitude smaller than final model, but I want to start small):

Input data: 5000x1 time series vector, split into 5 epochs of 1000x1
For each time step, 3 epochs worth of data will be put through 3 time-distributed copies of a bidirectional LSTM layer, and each of those will output a vector of 10x1 (10 features extracted), which will then be taken as the input for a second bidirectional LSTM layer.
For each time step, the first and last labels are ignored, but the center one is what is desired.

Here's what I've come up with, which does compile. However, looking at the model.summary, I think I'm missing the fact that I want the first LSTM to be run on a 3 of the input sequences for each output time step. What am I doing wrong?

model = Sequential()
model.add(TimeDistributed(Bidirectional(LSTM(11, return_sequences=True, recurrent_dropout=0.1, unit_forget_bias=True), input_shape=(3, 3, epoch_len), merge_mode='sum'), input_shape=(n_epochs, 3, epoch_len)))
model.add(TimeDistributed(Dense(7)))
model.add(TimeDistributed(Flatten()))
model.add(Bidirectional(LSTM(12, return_sequences=True, recurrent_dropout=0.1, unit_forget_bias=True), merge_mode='sum'))
model.add(TimeDistributed(Dense(n_classes, activation='softmax')))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

回答1:

Since your question is a bit confused, I'll take the following assumptions.

You have one time series of 5000 time steps, each step with one feature. Shape (1, 5000, 1)
The main part of the answer to your question: You want to run a "sliding window" case, being the size of the window equal to 3000, and the stride of the window being 1000.
You want the window size to be divided in 3 internal time series, each of these 3 series with 1000 steps, each step with only one feature. Each of these series enters the same LSTM as independent series (which is equivalent to having 3 copies of the LSTM) - Shape (slidingWindowSteps, 3, 1000, 1)
Important: From these 3 series, you want 3 outputs without length and with 10 features. Shape (1,3,10). (Your image says 1x10, but your text says 10x1, I'm assuming the image is correct).
You want these 3 outputs to be merged in a single sequence of 3 steps, shape (1,3,10)
You want the LSTM that processes this 3 step sequence to also return a 3 step sequence

Preparing for the sliding window case:

In a sliding window case, it's unavoidable to duplicate data. You need to first work in your input.

Taking the initial time series (1,5000,1), we need to split it prolerly in a batch containing samples with 3 groups of 1000. Here I do this for X only, you will have to do a similar thing to Y

numberOfOriginalSequences = 1
totalSteps = 5000
features = 1

#example of original input with 5000 steps
originalSeries = np.array(
                        range(numberOfOriginalSequences*totalSteps*features)
                 ).reshape((numberOfOriginalSequences,
                            totalSteps,
                            features))  

windowSize = 3000
windowStride = 1000

totalWindowSteps = ((totalSteps - windowSize)//windowStride) + 1

#at first, let's keep these dimensions for better understanding 
processedSequences = np.empty((numberOfOriginalSequences,
                               totalWindowSteps,
                               windowSize,
                               features))

for seq in range(numberOfOriginalSequences):
    for winStep in range(totalWindowSteps):
        start = winStep * windowStride
        end = start + windowSize
        processedSequences[seq,winStep,:,:] = originalSeries[seq,start:end,:]    

#now we reshape the array to transform each window step in independent sequences:
totalSamples = numberOfOriginalSequences*totalWindowSteps
groupsInWindow = windowSize // windowStride
processedSequences = processedSequences.reshape((totalSamples,
                                                 groupsInWindow,
                                                 windowStride,
                                                 features))

print(originalSeries)
print(processedSequences)

Creating the model:

A few comments about your first added layer:

The model only takes into account one input_shape. And this shape is (groupsInWindow,windowStride,features). It should be in the most external wrapper: the TimeDistributed.
You don't want to keep 1000 time steps, you want only 10 resulting features: return_sequences = False. (You can use many LSTMs in this first stage, if you want more layers. In this case the first ones can keep the steps, only the last one needs to use return_sequences=False)
You want 10 features, so units=10

I'll use the functional API just to see the input shape in the summary, which hels understanding things.

from keras.models import Model

intermediateFeatures = 10

inputTensor = Input((groupsInWindow,windowStride,features))

out = TimeDistributed(
    Bidirectional(
        LSTM(intermediateFeatures, 
             return_sequences=False, 
             recurrent_dropout=0.1, 
             unit_forget_bias=True), 
        merge_mode='sum'))(inputTensor)

At this point, you have eliminated the 1000 time steps. Since we used return_sequences=False, there will be no need to flatten or things like that. The data is already shaped in the form (samples, groupsInWindow,intermediateFeatures). The Dense layer is also not necessary. But it wouldn't be "wrong" if you wanted to do it the way you did, as long as the final shape is the same.

arbitraryLSTMUnits = 12
n_classes = 17

out = Bidirectional(
    LSTM(arbitraryLSTMUnits, 
         return_sequences=True, 
         recurrent_dropout=0.1, 
         unit_forget_bias=True), 
    merge_mode='sum')(out)

out = TimeDistributed(Dense(n_classes, activation='softmax'))(out)

And if you're going to discard the borders, you can add this layer:

out = Lambda(lambda x: x[:,1,:])(out) #model.add(Lambda(lambda x: x[:,1,:]))

Completing the model:

model = Model(inputTensor,out)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Here is how dimensions are flowing through this model. The first dimension I put here (totalSamples) is shown as None in the model.summary().

Input: (totalSamples,groupsInWindow,windowStride,features)
The Time Distributed LSTM works like this:
- TimeDistributed allows a 4th dimension, which is groupsInWindow. This dimension will be kept.
- The LSTM with return_sequences=False will eliminate the windowStride and change the features (windowStride, the second last dimension, is at the time steps position for this LSTM):
- result: (totalSamples, groupsInWindow, intermadiateFeatures)
The other LSTM, without time distributed, will not have the 4th dimension. This way, groupsInWindow (the second last) will be the "time steps". But return_sequences=True will not eliminate the time steps as the first LSTM did. Result: (totalSamples, groupsInWindow, arbitraryLSTMUnits)
The final Dense layer, because it's receiving a 3D input, will interpret the second dimension as if it were a TimeDistributed and leave it unchanged, applying itself only to the features dimension. Result: (totalSamples, groupsInWindow, n_classes)

来源：https://stackoverflow.com/questions/46859712/confused-about-how-to-implement-time-distributed-lstm-lstm

标签

keras

lstm