Is it possible to use spacy with already tokenized input?

六眼飞鱼酱① 提交于 2019-12-05 05:31:41

You can do this by replacing spaCy's default tokenizer with your own:

nlp.tokenizer = custom_tokenizer

Where custom_tokenizer is a function taking raw text as input and returning a Doc object.

You did not specify how you got the list of tokens. If you already have a function that takes raw text and returns a list of tokens, just make a small change to it:

def custom_tokenizer(text):
    tokens = []

    # your existing code to fill the list with tokens

    # replace this line:
    return tokens

    # with this:
    return Doc(nlp.vocab, tokens)

See the documentation on Doc.

If for some reason you cannot do this (maybe you don't have access to the tokenization function), you can use a dictionary:

tokens_dict = {'Hello, world.': ['Hello', ',', 'world', '.']}

def custom_tokenizer(text):
    if text in tokens_dict:
        return Doc(nlp.vocab, tokens_dict[text])
    else:
        raise ValueError('No tokenization available for input.')

Either way, you can then use the pipeline as in your first example:

doc = nlp('Hello, world.')
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!