bert-language-model

Google BERT and antonym detection

旧巷老猫 提交于 2021-02-11 15:10:55
问题 I recently learned about the following phenomenon: Google BERT word embeddings of well-known state-of-the-art models seem to ignore the measure of semantical contrast between antonyms in terms of the natural distance(norm2 or cosine distance) between the corresponding embeddings. For example: The measure is the "cosine distance" (as oppose to the "cosine similarity"), that means closer vectors are supposed to have smaller distance between them. As one can see, BERT states "weak" and "powerful

how to use bert for long sentences? [duplicate]

别来无恙 提交于 2021-02-10 15:50:21
问题 This question already has answers here : How to use Bert for long text classification? (6 answers) Closed 5 months ago . I am trying to classify given text into news, clickbait or others. The texts which I have for training are long.distribution of lengths is shown here. Now, the question is should I trim the text at the middle and make it 512 tokens long? But, I have even documents with circa 10,000 words so won't I loose the gist by truncation? Or, should I split my text into sub texts of

how to use bert for long sentences? [duplicate]

最后都变了- 提交于 2021-02-10 15:47:33
问题 This question already has answers here : How to use Bert for long text classification? (6 answers) Closed 5 months ago . I am trying to classify given text into news, clickbait or others. The texts which I have for training are long.distribution of lengths is shown here. Now, the question is should I trim the text at the middle and make it 512 tokens long? But, I have even documents with circa 10,000 words so won't I loose the gist by truncation? Or, should I split my text into sub texts of

how to use bert for long sentences? [duplicate]

懵懂的女人 提交于 2021-02-10 15:47:03
问题 This question already has answers here : How to use Bert for long text classification? (6 answers) Closed 5 months ago . I am trying to classify given text into news, clickbait or others. The texts which I have for training are long.distribution of lengths is shown here. Now, the question is should I trim the text at the middle and make it 512 tokens long? But, I have even documents with circa 10,000 words so won't I loose the gist by truncation? Or, should I split my text into sub texts of

Create SavedModel for BERT

怎甘沉沦 提交于 2021-01-29 18:20:39
问题 I'm using this Colab for BERT model. In last cells in order to make predictions we have: def getPrediction(in_sentences): labels = ["Negative", "Positive"] input_examples = [run_classifier.InputExample(guid="", text_a = x, text_b = None, label = 0) for x in in_sentences] # here, "" is just a dummy label input_features = run_classifier.convert_examples_to_features(input_examples, label_list, MAX_SEQ_LENGTH, tokenizer) predict_input_fn = run_classifier.input_fn_builder(features=input_features,

CUDA out of memory

心不动则不痛 提交于 2021-01-29 12:58:23
问题 I am getting error when trying to run BERT model for NER task. "CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 3.82 GiB total capacity; 2.58 GiB already allocated; 25.38 MiB free; 6.33 MiB cached)I have also tried reducing batch size to 1c enter code here epochs = 10 max_grad_norm = 1.0 for _ in trange(epochs, desc="Epoch"): # TRAIN loop model.train() tr_loss = 0 nb_tr_examples, nb_tr_steps = 0, 0 for step, batch in enumerate(train_dataloader): # add batch to gpu batch = tuple(t.to

Confusion in understanding the output of BERTforTokenClassification class from Transformers library

旧巷老猫 提交于 2021-01-28 19:04:01
问题 It is the example given in the documentation of transformers pytorch library from transformers import BertTokenizer, BertForTokenClassification import torch tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertForTokenClassification.from_pretrained('bert-base-uncased', output_hidden_states=True, output_attentions=True) input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1 labels = torch.tensor([1] * input

How to stop BERT from breaking apart specific words into word-piece

ぃ、小莉子 提交于 2021-01-28 06:06:29
问题 I am using a pre-trained BERT model to tokenize a text into meaningful tokens. However, the text has many specific words and I don't want BERT model to break them into word-pieces. Is there any solution to it? For example: tokenizer = BertTokenizer('bert-base-uncased-vocab.txt') tokens = tokenizer.tokenize("metastasis") Create tokens like this: ['meta', '##sta', '##sis'] However, I want to keep the whole words as one token, like this: ['metastasis'] 回答1: You are free to add new tokens to the

Where can I get the pretrained word embeddinngs for BERT?

蓝咒 提交于 2021-01-20 11:57:06
问题 I know that BERT has total vocabulary size of 30522 which contains some words and subwords. I want to get the initial input embeddings of BERT. So, my requirement is to get the table of size [30522, 768] to which I can index by token id to get its embeddings. Where can I get this table? 回答1: The BertModels have get_input_embeddings(): import torch from transformers import BertModel, BertTokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') bert = BertModel.from_pretrained(

Fine-tune Bert for specific domain (unsupervised)

孤人 提交于 2021-01-20 08:39:56
问题 I want to fine-tune BERT on texts that are related to a specific domain (in my case related to engineering). The training should be unsupervised since I don't have any labels or anything. Is this possible? 回答1: What you in fact want to is continue pre-training BERT on text from your specific domain. What you do in this case is to continue training the model as masked language model, but on your domain-specific data. You can use the run_mlm.py script from the Huggingface's Transformers. 来源: