POS tagging - NLTK thinks noun is adjective

后端 未结 5 1791
执笔经年
执笔经年 2020-12-19 11:56

In the following code, why does nltk think \'fish\' is an adjective and not a noun?

>>> import nltk
>>> s = \"a woman needs a man like a fi         


        
相关标签:
5条回答
  • 2020-12-19 12:34

    It's because you want a woman needs a man like a fish needs a bicycle to get POS tags for such a "parse":

    [ [[a woman] needs [a man]] like [[a fish] needs [a bicycle]] ]

    but instead the NLTK default pos tagger isn't smart enough and gave you POS tag for such a parse:

    [ [[a woman] needs [a man]] like [a fish needs] [a bicycle] ]

    0 讨论(0)
  • 2020-12-19 12:41

    If you use the Stanford POS tagger (3.5.1) then the phrase is correctly tagged:

    from nltk.tag.stanford import POSTagger
    st = POSTagger("/.../stanford-postagger-full-2015-01-30/models/english-left3words-distsim.tagger",
                   "/.../stanford-postagger-full-2015-01-30/stanford-postagger.jar")
    st.tag("a woman needs a man like a fish needs a bicycle".split())
    

    yields:

    [('a', 'DT'),
     ('woman', 'NN'),
     ('needs', 'VBZ'),
     ('a', 'DT'),
     ('man', 'NN'),
     ('like', 'IN'),
     ('a', 'DT'),
     ('fish', 'NN'),
     ('needs', 'VBZ'),
     ('a', 'DT'),
     ('bicycle', 'NN')]
    
    0 讨论(0)
  • 2020-12-19 12:47

    I am not sure what is the workaround but you can check the source here https://nltk.googlecode.com/svn/trunk/nltk/nltk/tag/

    Meanwhile I tried your sentence with little different approach.

    >>> s = "a woman needs a man. A fish needs a bicycle"
    >>> nltk.pos_tag(s.split())
    [('a', 'DT'), ('woman', 'NN'), ('needs', 'VBZ'), ('a', 'DT'), ('man.', NP'), ('A','NNP'),   ('fish', 'NN'), ('needs', 'VBZ'), ('a', 'DT'), ('bicycle', 'NN')]
    

    which resulted in fish as "NN".

    0 讨论(0)
  • 2020-12-19 12:48

    If you used a Lookup Tagger as described in the NLTK book, chapter 5 (for example using WordNet as lookup reference) first, your tagger would already "know" that fish cannot be an adjective. For all words with several possible POS Tags you could then use a statistical tagger as a backoff tagger.

    0 讨论(0)
  • 2020-12-19 12:55

    It depends on how the POS tagger is given the input. For example for the sentence: "a woman needs a man like a fish needs a bicycle"

    If you use the default nltk word tokenizer and a regex tokenizer, the values will be different.

    import nltk 
    from nltk.tokenize import RegexpTokenizer
    
    TOKENIZER = RegexpTokenizer('(?u)\W+|\$[\d\.]+|\S+')
    
    s = "a woman needs a man like a fish needs a bicycle"
    
    regex_tokenize = TOKENIZER.tokenize(s)
    default_tokenize = nltk.word_tokenize(s)
    
    regex_tag = nltk.pos_tag(regex_tokenize)
    default_tag = nltk.pos_tag(default_tokenize)
    
    print regex_tag
    print "\n"
    print default_tag
    

    The output is as follows:

      Regex Tokenizer: 
    
    [('a', 'DT'), (' ', 'NN'), ('woman', 'NN'), (' ', ':'), ('needs', 'NNS'), (' ', 'VBP'), ('a', 'DT'), (' ', 'NN'), ('man', 'NN'), (' ', ':'), ('like', 'IN'), (' ', 'NN'), ('a', 'DT'), (' ', 'NN'), ('fish', 'NN'), (' ', ':'), ('needs', 'VBZ'), (' ', ':'), ('a', 'DT'), (' ', 'NN'), ('bicycle', 'NN')]
    
     Default Tokenizer: 
    
    [('a', 'DT'), ('woman', 'NN'), ('needs', 'VBZ'), ('a', 'DT'), ('man', 'NN'), ('like', 'IN'), ('a', 'DT'), ('fish', 'JJ'), ('needs', 'NNS'), ('a', 'DT'), ('bicycle', 'NN')]
    

    In Regex Tokenizer fish is a noun while in the default tokenizer fish is an adjective. According to the tokenizer used, the parsing differs resulting in different parse tree structure.

    0 讨论(0)
提交回复
热议问题