I am implementing the Multi-Head Self-Attention in Pytorch now. I looked at a couple of implementations and they seem a bit wrong, or at least I am not sure