bert-language-model

AttributeError: 'str' object has no attribute 'dim' in pytorch

孤街浪徒 提交于 2020-12-12 02:06:16
问题 I got the following error output in the PyTorch when sent model predictions into the model. Does anyone know what's going on? Following are the architecture model that I created, in the error output, it shows the issue exists in the x = self.fc1(cls_hs) line. class BERT_Arch(nn.Module): def __init__(self, bert): super(BERT_Arch, self).__init__() self.bert = bert # dropout layer self.dropout = nn.Dropout(0.1) # relu activation function self.relu = nn.ReLU() # dense layer 1 self.fc1 = nn.Linear

AttributeError: 'str' object has no attribute 'dim' in pytorch

痞子三分冷 提交于 2020-12-12 02:06:03
问题 I got the following error output in the PyTorch when sent model predictions into the model. Does anyone know what's going on? Following are the architecture model that I created, in the error output, it shows the issue exists in the x = self.fc1(cls_hs) line. class BERT_Arch(nn.Module): def __init__(self, bert): super(BERT_Arch, self).__init__() self.bert = bert # dropout layer self.dropout = nn.Dropout(0.1) # relu activation function self.relu = nn.ReLU() # dense layer 1 self.fc1 = nn.Linear

Get probability of multi-token word in MASK position

橙三吉。 提交于 2020-12-05 11:57:31
问题 It is relatively easy to get a token's probability according to a language model, as the snippet below shows. You can get the output of a model, restrict yourself to the output of the masked token, and then find the probability of your requested token in the output vector. However, this only works with single-token words, e.g. words that are themselves in the tokenizer's vocabulary. When a word does not exist in the vocabulary, the tokenizer will chunk it up into pieces that it does know (see

Get probability of multi-token word in MASK position

梦想与她 提交于 2020-12-05 11:55:01
问题 It is relatively easy to get a token's probability according to a language model, as the snippet below shows. You can get the output of a model, restrict yourself to the output of the masked token, and then find the probability of your requested token in the output vector. However, this only works with single-token words, e.g. words that are themselves in the tokenizer's vocabulary. When a word does not exist in the vocabulary, the tokenizer will chunk it up into pieces that it does know (see

Why Bert transformer uses [CLS] token for classification instead of average over all tokens?

强颜欢笑 提交于 2020-12-01 12:00:50
问题 I am doing experiments on bert architecture and found out that most of the fine-tuning task takes the final hidden layer as text representation and later they pass it to other models for the further downstream task. Bert's last layer looks like this : Where we take the [CLS] token of each sentence : Image source I went through many discussion on this huggingface issue, datascience forum question, github issue Most of the data scientist gives this explanation : BERT is bidirectional, the [CLS]

Why Bert transformer uses [CLS] token for classification instead of average over all tokens?

旧巷老猫 提交于 2020-12-01 12:00:35
问题 I am doing experiments on bert architecture and found out that most of the fine-tuning task takes the final hidden layer as text representation and later they pass it to other models for the further downstream task. Bert's last layer looks like this : Where we take the [CLS] token of each sentence : Image source I went through many discussion on this huggingface issue, datascience forum question, github issue Most of the data scientist gives this explanation : BERT is bidirectional, the [CLS]