Custom sentence boundary detection in SpaCy

让人想犯罪 __ 提交于 2021-02-08 01:50:57

问题


I'm trying to write a custom sentence segmenter in spaCy that returns the whole document as a single sentence.

I wrote a custom pipeline component that does it using the code from here.

I can't get it to work though, because instead of changing the sentence boundaries to take the whole document as a single sentence it throws two different errors.

If I create a blank language instance and only add my custom component to the pipeline I get this error:

ValueError: Sentence boundary detection requires the dependency parse, which requires a statistical model to be installed and loaded.

If I add the parser component to the pipeline

nlp = spacy.blank('es')
parser = nlp.create_pipe('parser')
nlp.add_pipe(parser, last=True)
def custom_sbd(doc):
    print("EXECUTING SBD!!!!!!!!!!!!!!!!!!!!")
    doc[0].sent_start = True
    for i in range(1, len(doc)):
        doc[i].sent_start = False
    return doc
nlp.begin_training()
nlp.add_pipe(custom_sbd, first=True)

I get the same error.

If I change the order for it to parse first and then change the sentence boundaries, the error changes to

Refusing to write to token.sent_start if its document is parsed, because this may cause inconsistent state.

So if it throws an error requiring the dependency parse if it's not present or it executes after the custom sentence boundary detection, and a different error when the dependency parse is executed first, what's the appropriate way to do it?

Thank you!


回答1:


Ines from spaCy answered my question here

Thanks for bringing this up – and sorry this is a little confusing. I'm pretty sure the first problem you describe is already fixed on master. spaCy should definitely respect custom sentence boundaries, even in pipelines with no dependency parser.

If you want to use your custom SBD component without a parser, a very simple solution would be to set doc.is_parsed = True in your custom component. So when Doc.sents checks for the dependency parse, it looks at is_parsed and won't complain.

If you want to use your component with the parser, make sure to add it before the parser. The parser should always respect already set sentence boundaries from previous processing steps.



来源:https://stackoverflow.com/questions/48443624/custom-sentence-boundary-detection-in-spacy

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!