Positional Encodings leads to worse convergence, language modeling
问题 This is a tough question, but I might as well try. I'm implementing the architecture from this paper https://arxiv.org/pdf/1503.08895.pdf for language modeling. See page 2 for a diagram, and the top of page 5 for the section on positional or "temporal" encoding. More on positional encoding can be found here, https://arxiv.org/pdf/1706.03762.pdf at the bottom of page 5/top of page 6. (I was directed to that second paper by the authors of the first.) So here's my keras implementation in a