I am implementing the following paper, which is a BERT for long sequences. They use sparse attention to adapt it for longer sequences. However, they say that the most effici