How to force a pos tag in spacy before/after tagger?

给你一囗甜甜゛ 提交于 2021-01-21 10:45:06

问题


If I process the sentence

'Return target card to your hand'

with spacy and the en_web_core_lg model, it recognize the tokens as below:

Return NOUN target NOUN card NOUN to ADP your ADJ hand NOUN

How can I force 'Return' to be tagged as a VERB? And how can I do it before the parser, so that the parser can better interpret relations between tokens?

There are other situations in which this would be useful. I am dealing with text which contains specific symbols such as {G}. These three characters should be considered a NOUN, as a whole, and {T} should be a VERB. But right now I do not know how to achieve that, without developing a new model for tokenizing and for tagging. If I could "force" a token, I could replace these symbols for something that would be recognized as one token and force it to be tagged appropriately. For example, I could replace {G} with SYMBOLG and force tagging SYMBOLG as NOUN.


回答1:


EDIT: this solution used spaCy 2.0.12 (IIRC).

To answer the second part of your question, you can add special tokenisation rules to the tokeniser, as stated in the docs here. The following code should do what you want, assuming those symbols are unambiguous:

import spacy

from spacy.symbols import ORTH, POS, NOUN, VERB

nlp = spacy.load('en')

nlp.tokenizer.add_special_case('{G}', [{ORTH: '{G}', POS: NOUN}])
nlp.tokenizer.add_special_case('{T}', [{ORTH: '{T}', POS: VERB}])

doc = nlp('This {G} a noun and this is a {T}')

for token in doc:
    print('{:10}{:10}'.format(token.text, token.pos_))

Output for this is (the tags are not correct, but this shows the special case rules have been applied):

This      DET       
{G}       NOUN      
a         DET       
noun      NOUN      
and       CCONJ     
this      DET       
is        VERB      
a         DET       
{T}       VERB      

As for the first part of your question, the problem with assigning a part-of-speech to individual words is that they are mostly ambiguous out of context (e.g. "return" noun or verb?). So the above method would not allow you to account for use in context and is likely to generate errors. spaCy does allow you to do token-based pattern matching however, so that is worth having a look at. Maybe there is a way to do what you're after.



来源:https://stackoverflow.com/questions/51766157/how-to-force-a-pos-tag-in-spacy-before-after-tagger

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!