I am developing a NN for a VQA task. I am actually using an embedding layer + LSTM to produce the question embedding, which is then combined with the CNN output to produce t