问题
In the following code, why does nltk think 'fish' is an adjective and not a noun?
>>> import nltk
>>> s = "a woman needs a man like a fish needs a bicycle"
>>> nltk.pos_tag(s.split())
[('a', 'DT'), ('woman', 'NN'), ('needs', 'VBZ'), ('a', 'DT'), ('man', 'NN'), ('like', 'IN'), ('a', 'DT'), ('fish', 'JJ'), ('needs', 'NNS'), ('a', 'DT'), ('bicycle', 'NN')]
回答1:
I am not sure what is the workaround but you can check the source here https://nltk.googlecode.com/svn/trunk/nltk/nltk/tag/
Meanwhile I tried your sentence with little different approach.
>>> s = "a woman needs a man. A fish needs a bicycle"
>>> nltk.pos_tag(s.split())
[('a', 'DT'), ('woman', 'NN'), ('needs', 'VBZ'), ('a', 'DT'), ('man.', NP'), ('A','NNP'), ('fish', 'NN'), ('needs', 'VBZ'), ('a', 'DT'), ('bicycle', 'NN')]
which resulted in fish as "NN".
回答2:
If you used a Lookup Tagger as described in the NLTK book, chapter 5 (for example using WordNet as lookup reference) first, your tagger would already "know" that fish cannot be an adjective. For all words with several possible POS Tags you could then use a statistical tagger as a backoff tagger.
回答3:
It's because you want a woman needs a man like a fish needs a bicycle
to get POS tags for such a "parse":
[ [[a woman] needs [a man]] like [[a fish] needs [a bicycle]] ]
but instead the NLTK default pos tagger isn't smart enough and gave you POS tag for such a parse:
[ [[a woman] needs [a man]] like [a fish needs] [a bicycle] ]
回答4:
It depends on how the POS tagger is given the input. For example for the sentence: "a woman needs a man like a fish needs a bicycle"
If you use the default nltk word tokenizer and a regex tokenizer, the values will be different.
import nltk
from nltk.tokenize import RegexpTokenizer
TOKENIZER = RegexpTokenizer('(?u)\W+|\$[\d\.]+|\S+')
s = "a woman needs a man like a fish needs a bicycle"
regex_tokenize = TOKENIZER.tokenize(s)
default_tokenize = nltk.word_tokenize(s)
regex_tag = nltk.pos_tag(regex_tokenize)
default_tag = nltk.pos_tag(default_tokenize)
print regex_tag
print "\n"
print default_tag
The output is as follows:
Regex Tokenizer:
[('a', 'DT'), (' ', 'NN'), ('woman', 'NN'), (' ', ':'), ('needs', 'NNS'), (' ', 'VBP'), ('a', 'DT'), (' ', 'NN'), ('man', 'NN'), (' ', ':'), ('like', 'IN'), (' ', 'NN'), ('a', 'DT'), (' ', 'NN'), ('fish', 'NN'), (' ', ':'), ('needs', 'VBZ'), (' ', ':'), ('a', 'DT'), (' ', 'NN'), ('bicycle', 'NN')]
Default Tokenizer:
[('a', 'DT'), ('woman', 'NN'), ('needs', 'VBZ'), ('a', 'DT'), ('man', 'NN'), ('like', 'IN'), ('a', 'DT'), ('fish', 'JJ'), ('needs', 'NNS'), ('a', 'DT'), ('bicycle', 'NN')]
In Regex Tokenizer fish is a noun while in the default tokenizer fish is an adjective. According to the tokenizer used, the parsing differs resulting in different parse tree structure.
回答5:
If you use the Stanford POS tagger (3.5.1) then the phrase is correctly tagged:
from nltk.tag.stanford import POSTagger
st = POSTagger("/.../stanford-postagger-full-2015-01-30/models/english-left3words-distsim.tagger",
"/.../stanford-postagger-full-2015-01-30/stanford-postagger.jar")
st.tag("a woman needs a man like a fish needs a bicycle".split())
yields:
[('a', 'DT'),
('woman', 'NN'),
('needs', 'VBZ'),
('a', 'DT'),
('man', 'NN'),
('like', 'IN'),
('a', 'DT'),
('fish', 'NN'),
('needs', 'VBZ'),
('a', 'DT'),
('bicycle', 'NN')]
来源:https://stackoverflow.com/questions/13529945/pos-tagging-nltk-thinks-noun-is-adjective