Can I use multiple softmax in the last output layer in transformers? If so, how can I calculate loss from that. I am working in pytorch.
And I am asking because my da