spacy sentence tokenization error on Hebrew

问题

Trying to use spacy sentence tokenization for Hebrew.

import spacy
nlp = spacy.load('he')
doc = nlp(text)
sents = list(doc.sents)

I get:

    Warning: no model found for 'he'

    Only loading the 'he' tokenizer.

Traceback (most recent call last):   
  ...
    sents = list(doc.sents)   
  File "spacy/tokens/doc.pyx", line 438, in __get__ (spacy/tokens/doc.cpp:9707)
    raise ValueError( ValueError: Sentence boundary detection requires the dependency parse, which requires data to be installed. For more info, see the documentation:  https://spacy.io/docs/usage

what to do?

回答1:

spaCy's Hebrew coverage is currently quite minimal. It currently only has word tokenization for Hebrew, which roughly splits on white space with some extra rules and exceptions. The sentence tokenization/boundary detection that you want requires a more sophisticated grammatical parsing of the sentence in order to determine where one sentence ends and another begins. These models require a large amount of labeled training data, so are available for a smaller number of languages than have tokenization (here's the list).

The initial message is telling you that it can do tokenization, which doesn't require a model, and then the error you're getting is the result of not having a model to split sentences, do NER or POS, etc.

You might look at this list for other resources for Hebrew NLP. If you find enough labeled data in the right format and you're feeling ambitious, you could train your own Hebrew spaCy model using the overview described here.

来源：https://stackoverflow.com/questions/48572541/spacy-sentence-tokenization-error-on-hebrew

标签

python

spacy