Pretraining a language model on a small custom corpus

烂漫一生 提交于 2020-07-21 07:55:47

问题


I was curious if it is possible to use transfer learning in text generation, and re-train/pre-train it on a specific kind of text.

For example, having a pre-trained BERT model and a small corpus of medical (or any "type") text, make a language model that is able to generate medical text. The assumption is that you do not have a huge amount of "medical texts" and that is why you have to use transfer learning.

Putting it as a pipeline, I would describe this as:

  1. Using a pre-trained BERT tokenizer.
  2. Obtaining new tokens from my new text and adding them to the existing pre-trained language model (i.e., vanilla BERT).
  3. Re-training the pre-trained BERT model on the custom corpus with the combined tokenizer.
  4. Generating text that resembles the text within the small custom corpus.

Does this sound familiar? Is it possible with hugging-face?


回答1:


I have not heard of the pipeline you just mentioned. In order to construct an LM for your use-case, you have basically two options:

  1. Further training BERT (-base/-large) model on your own corpus. This process is called domain-adaption as also described in this recent paper. This will adapt the learned parameters of BERT model to your specific domain (Bio/Medical text). Nonetheless, for this setting, you will need quite a large corpus to help BERT model better update its parameters.

  2. Using a pre-trained language model that is pre-trained on a large amount of domain-specific text either from the scratch or fine-tuned on vanilla BERT model. As you might know, the vanilla BERT model released by Google has been trained on Wikipedia text. After the vanilla BERT, researchers have tried to train the BERT architecture on other domains besides Wikipedia. You may be able to use these pre-trained models which have a deep understanding of domain-specific language. For your case, there are some models such as: BioBERT, BlueBERT, and SciBERT.

Is it possible with hugging-face?

I am not sure if huggingface developers have developed a robust approach for pre-training BERT model on custom corpora as claimed their code is still in progress, but if you are interested in doing this step, I suggest using Google research's bert code which has been written in Tensorflow and is totally robust (released by BERT's authors). In their readme and under Pre-training with BERT section, the exact process has been declared.



来源:https://stackoverflow.com/questions/61416197/pretraining-a-language-model-on-a-small-custom-corpus

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!