Multi Head Attention: Correct implementation of Linear Transformations of Q, K, V

后端未结

关注

 0  1517

I am implementing the Multi-Head Self-Attention in Pytorch now. I looked at a couple of implementations and they seem a bit wrong, or at least I am not sure