How to use Bert for long text classification?

问题

We know that bert has a max length limit of tokens = 512, So if an acticle has a length of much bigger than 512, such as 10000 tokens in text How can bert be used?

回答1:

You have basically three options:

You cut the longer texts off and only use the first 512 Tokens. The original BERT implementation (and probably the others as well) truncates longer sequences automatically. For most cases, this option is sufficient.
You can split your text in multiple subtexts, classifier each of them and combine the results back together ( choose the class which was predicted for most of the subtexts for example). This option is obviously more expensive.
You can even feed the output token for each subtext (as in option 2) to another network (but you won't be able to fine-tune) as described in this discussion.

I would suggest to try option 1, and only if this is not good enough to consider the other options.

回答2:

This paper compared a few different strategies: How to Fine-Tune BERT for Text Classification?. On the IMDb movie review dataset, they actually found that cutting out the middle of the text (rather than truncating the beginning or the end) worked best! It even outperformed more complex "hierarchical" approaches involving breaking the article into chunks and then recombining the results.

As another anecdote, I applied BERT to the Wikipedia Personal Attacks dataset here, and found that simple truncation worked well enough that I wasn't motivated to try other approaches :)

回答3:

I addition to chunking data and passing it to BERT, check the following new approaches.

There are new researches for long document analysis. As you've asked for Bert a similar pre-trained transformer 'Longform' has recently been made available from ALLEN NLP (https://arxiv.org/abs/2004.05150). Check out this link for the paper.

The related work section also mentions some previous work on long sequences. Google them too. I'll suggest at least go through Transformer XL (https://arxiv.org/abs/1901.02860). As far I know it was one of the initial models for long sequences, so would be good to use it as a foundation before moving into 'Longformers'.

回答4:

There is an approach used in the paper Defending Against Neural Fake News ( https://arxiv.org/abs/1905.12616)

Their generative model was producing outputs of 1024 tokens and they wanted to use BERT for human vs machine generations. They extended the sequence length which BERT uses simply by initializing 512 more embeddings and training them while they were fine-tuning BERT on their dataset.

回答5:

You can leverage from the HuggingFace Transformers library that includes the following list of Transformers that work with long texts (more than 512 tokens):

Reformer: that combines the modeling capacity of a Transformer with an architecture that can be executed efficiently on long sequences.
Longformer: with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer.

Eight other recently proposed efficient Transformer models include Sparse Transformers (Child et al.,2019), Linformer (Wang et al., 2020), Sinkhorn Transformers (Tay et al., 2020b), Performers (Choromanski et al., 2020b), Synthesizers (Tay et al., 2020a), Linear Transformers (Katharopoulos et al., 2020), and BigBird (Zaheeret al., 2020).

The paper from the authors from Google Research and DeepMind tries to make a comparison between these Transformers based on Long-Range Arena "aggregated metrics":

They also suggest that Longformers have better performance than Reformer when it comes to the classification task.

回答6:

There are two main methods:

Concatenating 'short' BERT altogether (which consists of 512 characters max)
Constructing a real long BERT (CogLTX, Blockwise BERT, Longformer, Big Bird)

I resumed some typical papers of BERT for long text in this post : https://lethienhoablog.wordpress.com/2020/11/19/paper-dissected-and-recap-4-which-bert-for-long-text/

You can have an overview of all methods there.

来源：https://stackoverflow.com/questions/58636587/how-to-use-bert-for-long-text-classification

标签

nlp

text-classification

bert-language-model