问题
I want to predict the next frame of a (greyscale) video given N
previous frames - using CNNs or RNNs in Keras. Most tutorials and other information regarding time series prediction and Keras use a 1-dimensional input in their network but mine would be 3D (N frames x rows x cols)
I'm currently really unsure what a good approach for this problem would be. My ideas include:
Using one or more LSTM layers. The problem here is that I'm not sure whether they're suited to take a series of images instead a series of scalars as input. Wouldn't the memory consumption explode? If it is okay to use them: How can I use them in Keras for higher dimensions?
Using 3D convolution on the input (the stack of previous video frames). This raises other questions: Why would this help when I'm not doing a classification but a prediction? How can I stack the layers in such a way that the input of the network has dimensions
(N x cols x rows)
and the output(1 x cols x rows)
?
I'm pretty new to CNNs/RNNs and Keras and would appreciate any hint into the right direction.
回答1:
So basically every approach has its advantages and disadvantages. Let's go throught the ones you provided and then other to find the best approach:
LSTM
: Among their biggest advantages is an ability to learn a long-term dependiencies patterns in your data. They were designed in order to be able to analyse long sequences like e.g. speech or text. This is also might cause problems because of number parameters which could be really high. Other typical recurrent network architectures likeGRU
might overcome this issues. The main disadvantage is that in their standard (sequential implementation) it's infeasible to fit it on a video data for the same reason why dense layers are bad for an imagery data - loads of time and spatial invariances must be learnt by a topology which is completely not suited for catching them in an efficient manner. Shifting a video by a pixel to the right might completely change the output of your network.Other thing which is worth to mention is that training
LSTM
is belived to be similiar to finding equilibrium between two rivalry processes - finding good weights for a dense-like output computations and finding a good inner-memory dynamic in processing sequences. Finding this equilibrium might last for a really long time but once its finded - it's usually quite stable and produces a really good results.Conv3D
: Among their biggest advantages one may easily find an ability to catch spatial and temporal invariances in the same manner asConv2D
in an imagery case. This make the curse of dimensionality much less harmful. On the other hand - in the same way asConv1D
might not produce good results with a longer sequences - in the same way - a lack of any memory might make learning a long sequence harder.
Of course one may use different approaches like:
TimeDistributed + Conv2D
: using aTimeDistributed
wrapper - one may use some pretrained convnet like e.g.Inception
framewise and then analyse the feature maps sequentially. A really huge advantage of this approach is a possibility of a transfer learning. As a disadvantage - one may think about it as aConv2.5D
- it lacks temporal analysis of your data.ConvLSTM
: this architecture is not yet supported by the newest version ofKeras
(on March 6th 2017) but as one may see here it should be provided in the future. This is a mixture ofLSTM
andConv2D
and it's belived to be better then stackingConv2D
andLSTM
.
Of course these are not the only way to solve this problem, I'll mention one more which might be usefull:
- Stacking: one may easily stack the upper methods in order to build their final solution. E.g. one may build a network where at the beginning video is transformed using a
TimeDistributed(ResNet)
then output is feed toConv3D
with multiple and agressive spatial pooling and finally transformed by anGRU/LSTM
layer.
PS:
One more thing that is also worth to mention is that shape of video data is actually 4D
with (frames, width, height, channels
).
PS2:
In case when your data is actually 3D
with (frames, width, hieght)
you actually could use a classic Conv2D
(by changing channels
to frames
) to analyse this data (which actually might more computationally effective). In case of a transfer learning you should add additional dimension because most of CNN
models were trained on data with shape (width, height, 3)
. One may notice that your data doesn't have 3 channels. In this case a technique which is usually used is repeating spatial matrix three times.
PS3:
An example of this 2.5D
approach is:
input = Input(shape=input_shape)
base_cnn_model = InceptionV3(include_top=False, ..)
temporal_analysis = TimeDistributed(base_cnn_model)(input)
conv3d_analysis = Conv3D(nb_of_filters, 3, 3, 3)(temporal_analysis)
conv3d_analysis = Conv3D(nb_of_filters, 3, 3, 3)(conv3d_analysis)
output = Flatten()(conv3d_analysis)
output = Dense(nb_of_classes, activation="softmax")(output)
回答2:
After doing lots of research, I finally stumbled upon the Keras Example for the ConvLSTM2D
layer (Already mentioned by Marcin Możejko), which does exactly what I need.
In the current version of Keras (v1.2.2), this layer is already included and can be imported using
from keras.layers.convolutional_recurrent import ConvLSTM2D
To use this layer, the video data has to be formatted as follows:
[nb_samples, nb_frames, width, height, channels] # if using dim_ordering = 'tf'
[nb_samples, nb_frames, channels, width, height] # if using dim_ordering = 'th'
来源:https://stackoverflow.com/questions/42633644/using-keras-for-video-prediction-time-series