Lemmatizing POS tagged words with NLTK?

后端 未结 2 1347
旧巷少年郎
旧巷少年郎 2020-12-31 04:24

I have POS tagged some words with nltk.pos_tag(), so they are given treebank tags. I would like to lemmatize these words using the known POS tags, but I am not sure how. I w

相关标签:
2条回答
  • 2020-12-31 04:56

    As @engineercoding pointed out in the comments to @rmalouf's answer, there are quite a lot more tags in Treebank compared to WordNet, see here for details.

    The following mapping covers as large number of bases as possible, it also explicitly defines POS tags without matches in WordNet:

    # Create a map between Treebank and WordNet 
    from nltk.corpus import wordnet as wn
    
    # WordNet POS tags are: NOUN = 'n', ADJ = 's', VERB = 'v', ADV = 'r', ADJ_SAT = 'a'
    # Descriptions (c) https://web.stanford.edu/~jurafsky/slp3/10.pdf
    tag_map = {
            'CC':None, # coordin. conjunction (and, but, or)  
            'CD':wn.NOUN, # cardinal number (one, two)             
            'DT':None, # determiner (a, the)                    
            'EX':wn.ADV, # existential ‘there’ (there)           
            'FW':None, # foreign word (mea culpa)             
            'IN':wn.ADV, # preposition/sub-conj (of, in, by)   
            'JJ':[wn.ADJ, wn.ADJ_SAT], # adjective (yellow)                  
            'JJR':[wn.ADJ, wn.ADJ_SAT], # adj., comparative (bigger)          
            'JJS':[wn.ADJ, wn.ADJ_SAT], # adj., superlative (wildest)           
            'LS':None, # list item marker (1, 2, One)          
            'MD':None, # modal (can, should)                    
            'NN':wn.NOUN, # noun, sing. or mass (llama)          
            'NNS':wn.NOUN, # noun, plural (llamas)                  
            'NNP':wn.NOUN, # proper noun, sing. (IBM)              
            'NNPS':wn.NOUN, # proper noun, plural (Carolinas)
            'PDT':[wn.ADJ, wn.ADJ_SAT], # predeterminer (all, both)            
            'POS':None, # possessive ending (’s )               
            'PRP':None, # personal pronoun (I, you, he)     
            'PRP$':None, # possessive pronoun (your, one’s)    
            'RB':wn.ADV, # adverb (quickly, never)            
            'RBR':wn.ADV, # adverb, comparative (faster)        
            'RBS':wn.ADV, # adverb, superlative (fastest)     
            'RP':[wn.ADJ, wn.ADJ_SAT], # particle (up, off)
            'SYM':None, # symbol (+,%, &)
            'TO':None, # “to” (to)
            'UH':None, # interjection (ah, oops)
            'VB':wn.VERB, # verb base form (eat)
            'VBD':wn.VERB, # verb past tense (ate)
            'VBG':wn.VERB, # verb gerund (eating)
            'VBN':wn.VERB, # verb past participle (eaten)
            'VBP':wn.VERB, # verb non-3sg pres (eat)
            'VBZ':wn.VERB, # verb 3sg pres (eats)
            'WDT':None, # wh-determiner (which, that)
            'WP':None, # wh-pronoun (what, who)
            'WP$':None, # possessive (wh- whose)
            'WRB':None, # wh-adverb (how, where)
            '$':None, #  dollar sign ($)
            '#':None, # pound sign (#)
            '“':None, # left quote (‘ or “)
            '”':None, # right quote (’ or ”)
            '(':None, # left parenthesis ([, (, {, <)
            ')':None, # right parenthesis (], ), }, >)
            ',':None, # comma (,)
            '.':None, # sentence-final punc (. ! ?)
            ':':None # mid-sentence punc (: ; ... – -)
        }
    
    0 讨论(0)
  • The wordnet lemmatizer only knows four parts of speech (ADJ, ADV, NOUN, and VERB) and only the NOUN and VERB rules do anything especially interesting. The noun parts of speech in the treebank tagset all start with NN, the verb tags all start with VB, the adjective tags start with JJ, and the adverb tags start with RB. So, converting from one set of labels to the other is pretty easy, something like:

    from nltk.corpus import wordnet
    
    morphy_tag = {'NN':wordnet.NOUN,'JJ':wordnet.ADJ,'VB':wordnet.VERB,'RB':wordnet.ADV}[penn_tag[:2]]
    
    0 讨论(0)
提交回复
热议问题