I may be mistaken, but it seems that PyTorch Transformers are autoregressive, which is what masking is for. However, I\'ve seen some implementations where people use just th