bert-language-model | 易学教程

Google BERT and antonym detection

阅读更多关于 Google BERT and antonym detection

问题 I recently learned about the following phenomenon: Google BERT word embeddings of well-known state-of-the-art models seem to ignore the measure of semantical contrast between antonyms in terms of the natural distance(norm2 or cosine distance) between the corresponding embeddings. For example: The measure is the "cosine distance" (as oppose to the "cosine similarity"), that means closer vectors are supposed to have smaller distance between them. As one can see, BERT states "weak" and "powerful

how to use bert for long sentences? [duplicate]

阅读更多关于 how to use bert for long sentences? [duplicate]

问题 This question already has answers here : How to use Bert for long text classification? (6 answers) Closed 5 months ago . I am trying to classify given text into news, clickbait or others. The texts which I have for training are long.distribution of lengths is shown here. Now, the question is should I trim the text at the middle and make it 512 tokens long? But, I have even documents with circa 10,000 words so won't I loose the gist by truncation? Or, should I split my text into sub texts of

how to use bert for long sentences? [duplicate]

阅读更多关于 how to use bert for long sentences? [duplicate]

how to use bert for long sentences? [duplicate]

阅读更多关于 how to use bert for long sentences? [duplicate]

Create SavedModel for BERT

阅读更多关于 Create SavedModel for BERT

问题 I'm using this Colab for BERT model. In last cells in order to make predictions we have: def getPrediction(in_sentences): labels = ["Negative", "Positive"] input_examples = [run_classifier.InputExample(guid="", text_a = x, text_b = None, label = 0) for x in in_sentences] # here, "" is just a dummy label input_features = run_classifier.convert_examples_to_features(input_examples, label_list, MAX_SEQ_LENGTH, tokenizer) predict_input_fn = run_classifier.input_fn_builder(features=input_features,

CUDA out of memory

阅读更多关于 CUDA out of memory

问题 I am getting error when trying to run BERT model for NER task. "CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 3.82 GiB total capacity; 2.58 GiB already allocated; 25.38 MiB free; 6.33 MiB cached)I have also tried reducing batch size to 1c enter code here epochs = 10 max_grad_norm = 1.0 for _ in trange(epochs, desc="Epoch"): # TRAIN loop model.train() tr_loss = 0 nb_tr_examples, nb_tr_steps = 0, 0 for step, batch in enumerate(train_dataloader): # add batch to gpu batch = tuple(t.to

Confusion in understanding the output of BERTforTokenClassification class from Transformers library

阅读更多关于 Confusion in understanding the output of BERTforTokenClassification class from Transformers library

问题 It is the example given in the documentation of transformers pytorch library from transformers import BertTokenizer, BertForTokenClassification import torch tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertForTokenClassification.from_pretrained('bert-base-uncased', output_hidden_states=True, output_attentions=True) input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1 labels = torch.tensor([1] * input

How to stop BERT from breaking apart specific words into word-piece

阅读更多关于 How to stop BERT from breaking apart specific words into word-piece

问题 I am using a pre-trained BERT model to tokenize a text into meaningful tokens. However, the text has many specific words and I don't want BERT model to break them into word-pieces. Is there any solution to it? For example: tokenizer = BertTokenizer('bert-base-uncased-vocab.txt') tokens = tokenizer.tokenize("metastasis") Create tokens like this: ['meta', '##sta', '##sis'] However, I want to keep the whole words as one token, like this: ['metastasis'] 回答1: You are free to add new tokens to the

Where can I get the pretrained word embeddinngs for BERT?

阅读更多关于 Where can I get the pretrained word embeddinngs for BERT?

问题 I know that BERT has total vocabulary size of 30522 which contains some words and subwords. I want to get the initial input embeddings of BERT. So, my requirement is to get the table of size [30522, 768] to which I can index by token id to get its embeddings. Where can I get this table? 回答1: The BertModels have get_input_embeddings(): import torch from transformers import BertModel, BertTokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') bert = BertModel.from_pretrained(

Fine-tune Bert for specific domain (unsupervised)

阅读更多关于 Fine-tune Bert for specific domain (unsupervised)

问题 I want to fine-tune BERT on texts that are related to a specific domain (in my case related to engineering). The training should be unsupervised since I don't have any labels or anything. Is this possible? 回答1: What you in fact want to is continue pre-training BERT on text from your specific domain. What you do in this case is to continue training the model as masked language model, but on your domain-specific data. You can use the run_mlm.py script from the Huggingface's Transformers. 来源：