Lemmatizing POS tagged words with NLTK?

后端 未结 2 1352
旧巷少年郎
旧巷少年郎 2020-12-31 04:24

I have POS tagged some words with nltk.pos_tag(), so they are given treebank tags. I would like to lemmatize these words using the known POS tags, but I am not sure how. I w

2条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2020-12-31 04:56

    As @engineercoding pointed out in the comments to @rmalouf's answer, there are quite a lot more tags in Treebank compared to WordNet, see here for details.

    The following mapping covers as large number of bases as possible, it also explicitly defines POS tags without matches in WordNet:

    # Create a map between Treebank and WordNet 
    from nltk.corpus import wordnet as wn
    
    # WordNet POS tags are: NOUN = 'n', ADJ = 's', VERB = 'v', ADV = 'r', ADJ_SAT = 'a'
    # Descriptions (c) https://web.stanford.edu/~jurafsky/slp3/10.pdf
    tag_map = {
            'CC':None, # coordin. conjunction (and, but, or)  
            'CD':wn.NOUN, # cardinal number (one, two)             
            'DT':None, # determiner (a, the)                    
            'EX':wn.ADV, # existential ‘there’ (there)           
            'FW':None, # foreign word (mea culpa)             
            'IN':wn.ADV, # preposition/sub-conj (of, in, by)   
            'JJ':[wn.ADJ, wn.ADJ_SAT], # adjective (yellow)                  
            'JJR':[wn.ADJ, wn.ADJ_SAT], # adj., comparative (bigger)          
            'JJS':[wn.ADJ, wn.ADJ_SAT], # adj., superlative (wildest)           
            'LS':None, # list item marker (1, 2, One)          
            'MD':None, # modal (can, should)                    
            'NN':wn.NOUN, # noun, sing. or mass (llama)          
            'NNS':wn.NOUN, # noun, plural (llamas)                  
            'NNP':wn.NOUN, # proper noun, sing. (IBM)              
            'NNPS':wn.NOUN, # proper noun, plural (Carolinas)
            'PDT':[wn.ADJ, wn.ADJ_SAT], # predeterminer (all, both)            
            'POS':None, # possessive ending (’s )               
            'PRP':None, # personal pronoun (I, you, he)     
            'PRP$':None, # possessive pronoun (your, one’s)    
            'RB':wn.ADV, # adverb (quickly, never)            
            'RBR':wn.ADV, # adverb, comparative (faster)        
            'RBS':wn.ADV, # adverb, superlative (fastest)     
            'RP':[wn.ADJ, wn.ADJ_SAT], # particle (up, off)
            'SYM':None, # symbol (+,%, &)
            'TO':None, # “to” (to)
            'UH':None, # interjection (ah, oops)
            'VB':wn.VERB, # verb base form (eat)
            'VBD':wn.VERB, # verb past tense (ate)
            'VBG':wn.VERB, # verb gerund (eating)
            'VBN':wn.VERB, # verb past participle (eaten)
            'VBP':wn.VERB, # verb non-3sg pres (eat)
            'VBZ':wn.VERB, # verb 3sg pres (eats)
            'WDT':None, # wh-determiner (which, that)
            'WP':None, # wh-pronoun (what, who)
            'WP$':None, # possessive (wh- whose)
            'WRB':None, # wh-adverb (how, where)
            '$':None, #  dollar sign ($)
            '#':None, # pound sign (#)
            '“':None, # left quote (‘ or “)
            '”':None, # right quote (’ or ”)
            '(':None, # left parenthesis ([, (, {, <)
            ')':None, # right parenthesis (], ), }, >)
            ',':None, # comma (,)
            '.':None, # sentence-final punc (. ! ?)
            ':':None # mid-sentence punc (: ; ... – -)
        }
    

提交回复
热议问题