Why Bert transformer uses [CLS] token for classification instead of average over all tokens?
- 阅读更多 关于 Why Bert transformer uses [CLS] token for classification instead of average over all tokens?
问题 I am doing experiments on bert architecture and found out that most of the fine-tuning task takes the final hidden layer as text representation and later they pass it to other models for the further downstream task. Bert's last layer looks like this : Where we take the [CLS] token of each sentence : Image source I went through many discussion on this huggingface issue, datascience forum question, github issue Most of the data scientist gives this explanation : BERT is bidirectional, the [CLS]