Sliding window for long text in BERT for Question Answering

浪尽此生 提交于 2021-01-05 00:27:26

问题


I've read post which explains how the sliding window works but I cannot find any information on how it is actually implemented.

From what I understand if the input are too long, sliding window can be used to process the text.

Please correct me if I am wrong. Say I have a text "In June 2017 Kaggle announced that it passed 1 million registered users".

Given some stride and max_len, the input can be split into chunks with over lapping words (not considering padding).

In June 2017 Kaggle announced that # chunk 1
announced that it passed 1 million # chunk 2
1 million registered users # chunk 3

If my questions were "when did Kaggle make the announcement" and "how many registered users" I can use chunk 1 and chunk 3 and not use chunk 2 at all in the model. Not quiet sure if I should still use chunk 2 to train the model

So the input will be: [CLS]when did Kaggle make the announcement[SEP]In June 2017 Kaggle announced that[SEP] and [CLS]how many registered users[SEP]1 million registered users[SEP]


Then if I have a question with no answers do I feed it into the model with all chunks like and indicate the starting and ending index as -1? For example "can pigs fly?"

[CLS]can pigs fly[SEP]In June 2017 Kaggle announced that[SEP]

[CLS]can pigs fly[SEP]announced that it passed 1 million[SEP]

[CLS]can pigs fly[SEP]1 million registered users[SEP]


As suggested in the comments, II tried to run squad_convert_example_to_features (source code) to investigate the problem I have above, but it doesn't seem to work, nor there are any documentation. It seems like run_squad.py from huggingface uses squad_convert_example_to_features with the s in example.

from transformers.data.processors.squad import SquadResult, SquadV1Processor, SquadV2Processor, squad_convert_example_to_features
from transformers import AutoTokenizer, AutoConfig, squad_convert_examples_to_features

FILE_DIR = "."

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
processor = SquadV2Processor()
examples = processor.get_train_examples(FILE_DIR)

features = squad_convert_example_to_features(
    example=examples[0],
    max_seq_length=384,
    doc_stride=128,
    max_query_length=64,
    is_training=True,
)

I get the error.

100%|██████████| 1/1 [00:00<00:00, 159.95it/s]
Traceback (most recent call last):
  File "<input>", line 25, in <module>
    sub_tokens = tokenizer.tokenize(token)
NameError: name 'tokenizer' is not defined

The error indicates that there are no tokenizers but it does not allow us to pass a tokenizer. Though it does work if I add a tokenizer while I am inside the function in debug mode. So how exactly do I use the squad_convert_example_to_features function?


回答1:


I think there is a problem with the examples you pick. Both squad_convert_examples_to_features and squad_convert_example_to_features have a sliding window approach implemented because squad_convert_examples_to_features is just a parallelization wrapper for squad_convert_example_to_features. But let's look at the single example function. First of all you need to call squad_convert_example_to_features_init to make the tokenizer global (this is done automatically for you in squad_convert_examples_to_features):

from transformers.data.processors.squad import SquadResult, SquadV1Processor, SquadV2Processor, squad_convert_examples_to_features, squad_convert_example_to_features_init
from transformers import AutoTokenizer, AutoConfig, squad_convert_examples_to_features

FILE_DIR = "."

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
squad_convert_example_to_features_init(tokenizer)

processor = SquadV2Processor()
examples = processor.get_train_examples(FILE_DIR)

features = squad_convert_example_to_features(
    example=examples[0],
    max_seq_length=384,
    doc_stride=128,
    max_query_length=64,
    is_training=True,
)
print(len(features))

Output:

1

You might say that this function is not using a sliding window approach, but this is wrong because your example doesn't needed to be split:

print(len(examples[0].question_text.split()) + len(examples[0].doc_tokens))

Output:

115

which is less as the max_seq_length which you have set to 384. Now lets try a different one:

print(len(examples[129603].question_text.split()) + len(examples[129603].doc_tokens))

features = squad_convert_example_to_features(
    example=examples[129603],
    max_seq_length=384,
    doc_stride=128,
    max_query_length=64,
    is_training=True,
)
print(len(features))

Output:

454
3

Which you can now compare with the original sample:

print('[CLS]' + examples[129603].question_text + '[SEP]' + ' '.join(examples[129603].doc_tokens) + '[SEP]')

for idx, f in enumerate(features):
    print('Split {}'.format(idx))
    print(' '.join(f.tokens))

Output:

[CLS]How often is hunting occurring in Delaware each year?[SEP]There is a very active tradition of hunting of small to medium-sized wild game in Trinidad and Tobago. Hunting is carried out with firearms, and aided by the use of hounds, with the illegal use of trap guns, trap cages and snare nets. With approximately 12,000 sport hunters applying for hunting licences in recent years (in a very small country of about the size of the state of Delaware at about 5128 square kilometers and 1.3 million inhabitants), there is some concern that the practice might not be sustainable. In addition there are at present no bag limits and the open season is comparatively very long (5 months - October to February inclusive). As such hunting pressure from legal hunters is very high. Added to that, there is a thriving and very lucrative black market for poached wild game (sold and enthusiastically purchased as expensive luxury delicacies) and the numbers of commercial poachers in operation is unknown but presumed to be fairly high. As a result, the populations of the five major mammalian game species (red-rumped agouti, lowland paca, nine-banded armadillo, collared peccary, and red brocket deer) are thought to be quite low (although scientifically conducted population studies are only just recently being conducted as of 2013). It appears that the red brocket deer population has been extirpated on Tobago as a result of over-hunting. Various herons, ducks, doves, the green iguana, the gold tegu, the spectacled caiman and the common opossum are also commonly hunted and poached. There is also some poaching of 'fully protected species', including red howler monkeys and capuchin monkeys, southern tamanduas, Brazilian porcupines, yellow-footed tortoises, Trinidad piping guans and even one of the national birds, the scarlet ibis. Legal hunters pay very small fees to obtain hunting licences and undergo no official basic conservation biology or hunting-ethics training. There is presumed to be relatively very little subsistence hunting in the country (with most hunting for either sport or commercial profit). The local wildlife management authority is under-staffed and under-funded, and as such very little in the way of enforcement is done to uphold existing wildlife management laws, with hunting occurring both in and out of season, and even in wildlife sanctuaries. There is some indication that the government is beginning to take the issue of wildlife management more seriously, with well drafted legislation being brought before Parliament in 2015. It remains to be seen if the drafted legislation will be fully adopted and financially supported by the current and future governments, and if the general populace will move towards a greater awareness of the importance of wildlife conservation and change the culture of wanton consumption to one of sustainable management.[SEP]
Split 0
[CLS] how often is hunting occurring in delaware each year ? [SEP] there is a very active tradition of hunting of small to medium - sized wild game in trinidad and tobago . hunting is carried out with firearms , and aided by the use of hounds , with the illegal use of trap guns , trap cages and s ##nare nets . with approximately 12 , 000 sport hunters applying for hunting licence ##s in recent years ( in a very small country of about the size of the state of delaware at about 512 ##8 square kilometers and 1 . 3 million inhabitants ) , there is some concern that the practice might not be sustainable . in addition there are at present no bag limits and the open season is comparatively very long ( 5 months - october to february inclusive ) . as such hunting pressure from legal hunters is very high . added to that , there is a thriving and very lucrative black market for po ##ache ##d wild game ( sold and enthusiastically purchased as expensive luxury del ##ica ##cies ) and the numbers of commercial po ##ache ##rs in operation is unknown but presumed to be fairly high . as a result , the populations of the five major mammalian game species ( red - rum ##ped ago ##uti , lowland pac ##a , nine - banded arm ##adi ##llo , collar ##ed pe ##cca ##ry , and red brock ##et deer ) are thought to be quite low ( although scientific ##ally conducted population studies are only just recently being conducted as of 2013 ) . it appears that the red brock ##et deer population has been ex ##ti ##rp ##ated on tobago as a result of over - hunting . various heron ##s , ducks , dove ##s , the green i ##gua ##na , the gold te ##gu , the spectacle ##d cai ##man and the common op ##oss ##um are also commonly hunted and po ##ache ##d . there is also some po ##achi ##ng of ' fully protected species ' , including red howl ##er monkeys and cap ##uchi ##n monkeys , southern tam ##and ##ua ##s , brazilian por ##cup ##ines , yellow - footed tor ##to ##ises , [SEP]
Split 1
[CLS] how often is hunting occurring in delaware each year ? [SEP] october to february inclusive ) . as such hunting pressure from legal hunters is very high . added to that , there is a thriving and very lucrative black market for po ##ache ##d wild game ( sold and enthusiastically purchased as expensive luxury del ##ica ##cies ) and the numbers of commercial po ##ache ##rs in operation is unknown but presumed to be fairly high . as a result , the populations of the five major mammalian game species ( red - rum ##ped ago ##uti , lowland pac ##a , nine - banded arm ##adi ##llo , collar ##ed pe ##cca ##ry , and red brock ##et deer ) are thought to be quite low ( although scientific ##ally conducted population studies are only just recently being conducted as of 2013 ) . it appears that the red brock ##et deer population has been ex ##ti ##rp ##ated on tobago as a result of over - hunting . various heron ##s , ducks , dove ##s , the green i ##gua ##na , the gold te ##gu , the spectacle ##d cai ##man and the common op ##oss ##um are also commonly hunted and po ##ache ##d . there is also some po ##achi ##ng of ' fully protected species ' , including red howl ##er monkeys and cap ##uchi ##n monkeys , southern tam ##and ##ua ##s , brazilian por ##cup ##ines , yellow - footed tor ##to ##ises , trinidad pip ##ing gu ##ans and even one of the national birds , the scarlet ib ##is . legal hunters pay very small fees to obtain hunting licence ##s and undergo no official basic conservation biology or hunting - ethics training . there is presumed to be relatively very little subsistence hunting in the country ( with most hunting for either sport or commercial profit ) . the local wildlife management authority is under - staffed and under - funded , and as such very little in the way of enforcement is done to uphold existing wildlife management laws , with hunting occurring both in and out of season , and even in wildlife san ##ct ##uaries . there is some indication that the government is beginning to [SEP]
Split 2
[CLS] how often is hunting occurring in delaware each year ? [SEP] being conducted as of 2013 ) . it appears that the red brock ##et deer population has been ex ##ti ##rp ##ated on tobago as a result of over - hunting . various heron ##s , ducks , dove ##s , the green i ##gua ##na , the gold te ##gu , the spectacle ##d cai ##man and the common op ##oss ##um are also commonly hunted and po ##ache ##d . there is also some po ##achi ##ng of ' fully protected species ' , including red howl ##er monkeys and cap ##uchi ##n monkeys , southern tam ##and ##ua ##s , brazilian por ##cup ##ines , yellow - footed tor ##to ##ises , trinidad pip ##ing gu ##ans and even one of the national birds , the scarlet ib ##is . legal hunters pay very small fees to obtain hunting licence ##s and undergo no official basic conservation biology or hunting - ethics training . there is presumed to be relatively very little subsistence hunting in the country ( with most hunting for either sport or commercial profit ) . the local wildlife management authority is under - staffed and under - funded , and as such very little in the way of enforcement is done to uphold existing wildlife management laws , with hunting occurring both in and out of season , and even in wildlife san ##ct ##uaries . there is some indication that the government is beginning to take the issue of wildlife management more seriously , with well drafted legislation being brought before parliament in 2015 . it remains to be seen if the drafted legislation will be fully adopted and financially supported by the current and future governments , and if the general populace will move towards a greater awareness of the importance of wildlife conservation and change the culture of want ##on consumption to one of sustainable management . [SEP]

If my questions were "when did Kaggle make the announcement" and "how many registered users" I can use chunk 1 and chunk 3 and not use chunk 2 at all in the model. Not quiet sure if I should still use chunk 2 to train the model

Yes you should also use chunk 2 to train your model, because when you try to predict the same sequence you hope that your model predicts 0:0 as answer span for chunk 2 (i.e. you can easily select the chunk which contains the answer).



来源:https://stackoverflow.com/questions/62978957/sliding-window-for-long-text-in-bert-for-question-answering

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!