How do you pass video features from a CNN to an LSTM?
After you pass a video frame through a convnet and get an output feature map, how do you pass that data into an LSTM? Also, how do you pass multiple frames to the LSTM thru the CNN? In other works I want to process video frames with an CNN to get the spatial features. Then I want pass these features to an LSTM to do temporal processing on the spatial features. How do I connect the LSTM to the video features? For example if the input video is 56x56 and then when passed through all of the CNN layers, say it comes out as 20: 5x5's. How are these connected to the LSTM on a frame by frame basis?