huggingface-transformers

I want to use “grouped_entities” in the huggingface pipeline for ner task, how to do that?

折月煮酒 提交于 2021-01-29 19:00:56
问题 I want to use "grouped_entities" in the huggingface pipeline for ner task. However having issues doing that. I do look the following link on git but this did not help: https://github.com/huggingface/transformers/pull/4987 回答1: I got the answer its very straight forward in the transformer v4.0.0. Previously I was using older version of transformer package. example: from transformers import AutoTokenizer, AutoModelForTokenClassification,TokenClassificationPipeline from transformers import

Question asking pipeline for Huggingface transformers

家住魔仙堡 提交于 2021-01-29 13:20:42
问题 Huggingface tranformers has a pipeline for question answering tuning on the Squad dataset. What would I need to do to develop a pipeline for a question asking pipeline? This would use the context, question and answer to generate questions with answers from a context. Are there any examples for creating new hunggingface pipelines? 回答1: Pipelines can simply be treated as a wrapper around pre-trained models. In this case, you could perform fine-tuning/pre-training in the same way as existing

Confusion in understanding the output of BERTforTokenClassification class from Transformers library

旧巷老猫 提交于 2021-01-28 19:04:01
问题 It is the example given in the documentation of transformers pytorch library from transformers import BertTokenizer, BertForTokenClassification import torch tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertForTokenClassification.from_pretrained('bert-base-uncased', output_hidden_states=True, output_attentions=True) input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1 labels = torch.tensor([1] * input

Huggingface saving tokenizer

旧巷老猫 提交于 2021-01-28 03:31:18
问题 I am trying to save the tokenizer in huggingface so that I can load it later from a container where I don't need access to the internet. BASE_MODEL = "distilbert-base-multilingual-cased" tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL) tokenizer.save_vocabulary("./models/tokenizer/") tokenizer2 = AutoTokenizer.from_pretrained("./models/tokenizer/") However, the last line is giving the error: OSError: Can't load config for './models/tokenizer3/'. Make sure that: - './models/tokenizer3/'

Where can I get the pretrained word embeddinngs for BERT?

蓝咒 提交于 2021-01-20 11:57:06
问题 I know that BERT has total vocabulary size of 30522 which contains some words and subwords. I want to get the initial input embeddings of BERT. So, my requirement is to get the table of size [30522, 768] to which I can index by token id to get its embeddings. Where can I get this table? 回答1: The BertModels have get_input_embeddings(): import torch from transformers import BertModel, BertTokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') bert = BertModel.from_pretrained(

Sliding window for long text in BERT for Question Answering

岁酱吖の 提交于 2021-01-05 00:51:51
问题 I've read post which explains how the sliding window works but I cannot find any information on how it is actually implemented. From what I understand if the input are too long, sliding window can be used to process the text. Please correct me if I am wrong. Say I have a text "In June 2017 Kaggle announced that it passed 1 million registered users" . Given some stride and max_len , the input can be split into chunks with over lapping words (not considering padding). In June 2017 Kaggle

Sliding window for long text in BERT for Question Answering

浪尽此生 提交于 2021-01-05 00:27:26
问题 I've read post which explains how the sliding window works but I cannot find any information on how it is actually implemented. From what I understand if the input are too long, sliding window can be used to process the text. Please correct me if I am wrong. Say I have a text "In June 2017 Kaggle announced that it passed 1 million registered users" . Given some stride and max_len , the input can be split into chunks with over lapping words (not considering padding). In June 2017 Kaggle

How to load the saved tokenizer from pretrained model in Pytorch

混江龙づ霸主 提交于 2020-12-30 08:37:28
问题 I fine-tuned a pretrained BERT model in Pytorch using huggingface transformer. All the training/validation is done on a GPU in cloud. At the end of the training, I save the model and tokenizer like below: best_model.save_pretrained('./saved_model/') tokenizer.save_pretrained('./saved_model/') This creates below files in the saved_model directory: config.json added_token.json special_tokens_map.json tokenizer_config.json vocab.txt pytorch_model.bin Now, I download the saved_model directory in

How to load the saved tokenizer from pretrained model in Pytorch

点点圈 提交于 2020-12-30 08:37:05
问题 I fine-tuned a pretrained BERT model in Pytorch using huggingface transformer. All the training/validation is done on a GPU in cloud. At the end of the training, I save the model and tokenizer like below: best_model.save_pretrained('./saved_model/') tokenizer.save_pretrained('./saved_model/') This creates below files in the saved_model directory: config.json added_token.json special_tokens_map.json tokenizer_config.json vocab.txt pytorch_model.bin Now, I download the saved_model directory in

Saving and reload huggingface fine-tuned transformer

我们两清 提交于 2020-12-26 11:11:18
问题 I am trying to reload a fine-tuned DistilBertForTokenClassification model. I am using transformers 3.4.0 and pytorch version 1.6.0+cu101. After using the Trainer to train the downloaded model, I save the model with trainer.save_model() and in my trouble shooting I save in a different directory via model.save_pretrained(). I am using Google Colab and saving the model to my Google drive. After testing the model I also evaluated the model on my test getting great results, however, when I return