spacy sentence tokenization error on Hebrew

时光总嘲笑我的痴心妄想 提交于 2020-01-24 13:33:51

问题


Trying to use spacy sentence tokenization for Hebrew.

import spacy
nlp = spacy.load('he')
doc = nlp(text)
sents = list(doc.sents)

I get:

    Warning: no model found for 'he'

    Only loading the 'he' tokenizer.

Traceback (most recent call last):   
  ...
    sents = list(doc.sents)   
  File "spacy/tokens/doc.pyx", line 438, in __get__ (spacy/tokens/doc.cpp:9707)
    raise ValueError( ValueError: Sentence boundary detection requires the dependency parse, which requires data to be installed. For more info, see the documentation:  https://spacy.io/docs/usage

what to do?


回答1:


spaCy's Hebrew coverage is currently quite minimal. It currently only has word tokenization for Hebrew, which roughly splits on white space with some extra rules and exceptions. The sentence tokenization/boundary detection that you want requires a more sophisticated grammatical parsing of the sentence in order to determine where one sentence ends and another begins. These models require a large amount of labeled training data, so are available for a smaller number of languages than have tokenization (here's the list).

The initial message is telling you that it can do tokenization, which doesn't require a model, and then the error you're getting is the result of not having a model to split sentences, do NER or POS, etc.

You might look at this list for other resources for Hebrew NLP. If you find enough labeled data in the right format and you're feeling ambitious, you could train your own Hebrew spaCy model using the overview described here.



来源:https://stackoverflow.com/questions/48572541/spacy-sentence-tokenization-error-on-hebrew

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!