POS tagging - NLTK thinks noun is adjective

时间秒杀一切 提交于 2019-12-29 07:52:17

问题


In the following code, why does nltk think 'fish' is an adjective and not a noun?

>>> import nltk
>>> s = "a woman needs a man like a fish needs a bicycle"
>>> nltk.pos_tag(s.split())
[('a', 'DT'), ('woman', 'NN'), ('needs', 'VBZ'), ('a', 'DT'), ('man', 'NN'), ('like', 'IN'), ('a', 'DT'), ('fish', 'JJ'), ('needs', 'NNS'), ('a', 'DT'), ('bicycle', 'NN')]

回答1:


I am not sure what is the workaround but you can check the source here https://nltk.googlecode.com/svn/trunk/nltk/nltk/tag/

Meanwhile I tried your sentence with little different approach.

>>> s = "a woman needs a man. A fish needs a bicycle"
>>> nltk.pos_tag(s.split())
[('a', 'DT'), ('woman', 'NN'), ('needs', 'VBZ'), ('a', 'DT'), ('man.', NP'), ('A','NNP'),   ('fish', 'NN'), ('needs', 'VBZ'), ('a', 'DT'), ('bicycle', 'NN')]

which resulted in fish as "NN".




回答2:


If you used a Lookup Tagger as described in the NLTK book, chapter 5 (for example using WordNet as lookup reference) first, your tagger would already "know" that fish cannot be an adjective. For all words with several possible POS Tags you could then use a statistical tagger as a backoff tagger.




回答3:


It's because you want a woman needs a man like a fish needs a bicycle to get POS tags for such a "parse":

[ [[a woman] needs [a man]] like [[a fish] needs [a bicycle]] ]

but instead the NLTK default pos tagger isn't smart enough and gave you POS tag for such a parse:

[ [[a woman] needs [a man]] like [a fish needs] [a bicycle] ]




回答4:


It depends on how the POS tagger is given the input. For example for the sentence: "a woman needs a man like a fish needs a bicycle"

If you use the default nltk word tokenizer and a regex tokenizer, the values will be different.

import nltk 
from nltk.tokenize import RegexpTokenizer

TOKENIZER = RegexpTokenizer('(?u)\W+|\$[\d\.]+|\S+')

s = "a woman needs a man like a fish needs a bicycle"

regex_tokenize = TOKENIZER.tokenize(s)
default_tokenize = nltk.word_tokenize(s)

regex_tag = nltk.pos_tag(regex_tokenize)
default_tag = nltk.pos_tag(default_tokenize)

print regex_tag
print "\n"
print default_tag

The output is as follows:

  Regex Tokenizer: 

[('a', 'DT'), (' ', 'NN'), ('woman', 'NN'), (' ', ':'), ('needs', 'NNS'), (' ', 'VBP'), ('a', 'DT'), (' ', 'NN'), ('man', 'NN'), (' ', ':'), ('like', 'IN'), (' ', 'NN'), ('a', 'DT'), (' ', 'NN'), ('fish', 'NN'), (' ', ':'), ('needs', 'VBZ'), (' ', ':'), ('a', 'DT'), (' ', 'NN'), ('bicycle', 'NN')]

 Default Tokenizer: 

[('a', 'DT'), ('woman', 'NN'), ('needs', 'VBZ'), ('a', 'DT'), ('man', 'NN'), ('like', 'IN'), ('a', 'DT'), ('fish', 'JJ'), ('needs', 'NNS'), ('a', 'DT'), ('bicycle', 'NN')]

In Regex Tokenizer fish is a noun while in the default tokenizer fish is an adjective. According to the tokenizer used, the parsing differs resulting in different parse tree structure.




回答5:


If you use the Stanford POS tagger (3.5.1) then the phrase is correctly tagged:

from nltk.tag.stanford import POSTagger
st = POSTagger("/.../stanford-postagger-full-2015-01-30/models/english-left3words-distsim.tagger",
               "/.../stanford-postagger-full-2015-01-30/stanford-postagger.jar")
st.tag("a woman needs a man like a fish needs a bicycle".split())

yields:

[('a', 'DT'),
 ('woman', 'NN'),
 ('needs', 'VBZ'),
 ('a', 'DT'),
 ('man', 'NN'),
 ('like', 'IN'),
 ('a', 'DT'),
 ('fish', 'NN'),
 ('needs', 'VBZ'),
 ('a', 'DT'),
 ('bicycle', 'NN')]


来源:https://stackoverflow.com/questions/13529945/pos-tagging-nltk-thinks-noun-is-adjective

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!