How to use tokenized sentence as input for Spacy's PoS tagger?

不想你离开。 提交于 2021-01-28 11:00:23

问题


Spacy's pos tagger is really convenient, it can directly tag on raw sentence.

import spacy  
sp = spacy.load('en_core_web_sm')  
sen = sp(u"I am eating")  

But I'm using tokenizer from nltk. So how to use a tokenized sentence like ['I', 'am', 'eating'] rather than 'I am eating' for the Spacy's tagger?

BTW, where can I found detailed Spacy documentation? I can only find an overview on the official website

Thanks.


回答1:


There's two options:

  1. You write a wrapper around the nltk tokenizer and use it to convert text to spaCy's Doc format. Then overwrite nlp.tokenizer with that new custom function. More info here: https://spacy.io/usage/linguistic-features#custom-tokenizer.

  2. Generate a Doc directly from a list of strings, like so:

    doc = Doc(nlp.vocab, words=[u"I", u"am", u"eating", u"."], spaces=[True, True, False, False])

    Defining the spaces is optional - if you leave it out, each word will be followed by a space by default. This matters when using e.g. the doc.text afterwards. More information here: https://spacy.io/usage/linguistic-features#own-annotations

[edit]: note that nlp and doc are sort of 'standard' variable names in spaCy, they correspond to the variables sp and sen respectively in your code



来源:https://stackoverflow.com/questions/56437945/how-to-use-tokenized-sentence-as-input-for-spacys-pos-tagger

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!