Pretraining a language model on a small custom corpus

问题

I was curious if it is possible to use transfer learning in text generation, and re-train/pre-train it on a specific kind of text.

For example, having a pre-trained BERT model and a small corpus of medical (or any "type") text, make a language model that is able to generate medical text. The assumption is that you do not have a huge amount of "medical texts" and that is why you have to use transfer learning.

Putting it as a pipeline, I would describe this as:

Using a pre-trained BERT tokenizer.
Obtaining new tokens from my new text and adding them to the existing pre-trained language model (i.e., vanilla BERT).
Re-training the pre-trained BERT model on the custom corpus with the combined tokenizer.
Generating text that resembles the text within the small custom corpus.

Does this sound familiar? Is it possible with hugging-face?

回答1:

I have not heard of the pipeline you just mentioned. In order to construct an LM for your use-case, you have basically two options:

Further training BERT (-base/-large) model on your own corpus. This process is called domain-adaption as also described in this recent paper. This will adapt the learned parameters of BERT model to your specific domain (Bio/Medical text). Nonetheless, for this setting, you will need quite a large corpus to help BERT model better update its parameters.
Using a pre-trained language model that is pre-trained on a large amount of domain-specific text either from the scratch or fine-tuned on vanilla BERT model. As you might know, the vanilla BERT model released by Google has been trained on Wikipedia text. After the vanilla BERT, researchers have tried to train the BERT architecture on other domains besides Wikipedia. You may be able to use these pre-trained models which have a deep understanding of domain-specific language. For your case, there are some models such as: BioBERT, BlueBERT, and SciBERT.

Is it possible with hugging-face?

I am not sure if huggingface developers have developed a robust approach for pre-training BERT model on custom corpora as claimed their code is still in progress, but if you are interested in doing this step, I suggest using Google research's bert code which has been written in Tensorflow and is totally robust (released by BERT's authors). In their readme and under Pre-training with BERT section, the exact process has been declared.

来源：https://stackoverflow.com/questions/61416197/pretraining-a-language-model-on-a-small-custom-corpus

标签

deep-learning

transfer-learning

huggingface-transformers

language-model

BERT