问题
Spacy's pos tagger is really convenient, it can directly tag on raw sentence.
import spacy
sp = spacy.load('en_core_web_sm')
sen = sp(u"I am eating")
But I'm using tokenizer from nltk
. So how to use a tokenized sentence like
['I', 'am', 'eating']
rather than 'I am eating' for the Spacy's tagger?
BTW, where can I found detailed Spacy documentation? I can only find an overview on the official website
Thanks.
回答1:
There's two options:
You write a wrapper around the
nltk
tokenizer and use it to convert text to spaCy'sDoc
format. Then overwritenlp.tokenizer
with that new custom function. More info here: https://spacy.io/usage/linguistic-features#custom-tokenizer.Generate a
Doc
directly from a list of strings, like so:doc = Doc(nlp.vocab, words=[u"I", u"am", u"eating", u"."], spaces=[True, True, False, False])
Defining the
spaces
is optional - if you leave it out, each word will be followed by a space by default. This matters when using e.g. thedoc.text
afterwards. More information here: https://spacy.io/usage/linguistic-features#own-annotations
[edit]: note that nlp
and doc
are sort of 'standard' variable names in spaCy, they correspond to the variables sp
and sen
respectively in your code
来源:https://stackoverflow.com/questions/56437945/how-to-use-tokenized-sentence-as-input-for-spacys-pos-tagger