Why does embedding vector multiplied by a constant in Transformer model?

前端 未结 2 1758
佛祖请我去吃肉
佛祖请我去吃肉 2021-01-02 21:29

I am learning to apply Transform model proposed by Attention Is All You Need from tensorflow official document Transformer model for language understanding

相关标签:
2条回答
  • 2021-01-02 21:54

    Looking around it, I found this argument 1:

    The reason we increase the embedding values before the addition is to make the positional encoding relatively smaller. This means the original meaning in the embedding vector won’t be lost when we add them together.

    0 讨论(0)
  • 2021-01-02 21:58

    I believe the reason for this scaling has nothing to do with the scale applied at the attention layers. It is likely because the transformer shares the weights of the embedding layer and the output softmax. The scales you would use for the embeddings is different than the scale you use for a fully connected layer.

    Some implementations of the transformer use this scaling even though they don't actually share the embedding weights at the output layer, but that is probably kept there for consistency (or by mistake). Just make sure that the initialization of your embeddings is consistent.

    0 讨论(0)
提交回复
热议问题