BigBird: How to implement 'roll' and 'gather' matrix-operation in PyTorch?

后端 未结 0 1332
遥遥无期
遥遥无期 2020-12-21 19:01

I am implementing the following paper, which is a BERT for long sequences. They use sparse attention to adapt it for longer sequences. However, they say that the most effici

相关标签:
回答
  • 消灭零回复
提交回复
热议问题