I am aware that attention mechanism proves itself specifically when dealing with long sequences, where problems related to gradient vanishing and, more generally, representi