I\'m having trouble understanding the documentation for PyTorch\'s LSTM module (and also RNN and GRU, which are similar). Regarding the outputs, it says:
It really depends on a model you use and how you will interpret the model. Output may be:
Output, is almost never interpreted directly. If the input is encoded there should be a softmax layer to decode the results.
Note: In language modeling hidden states are used to define the probability of the next word, p(wt+1|w1,...,wt) =softmax(Wht+b).