NLTK Named Entity recognition to a Python list

后端 未结 7 973
再見小時候
再見小時候 2020-11-28 08:14

I used NLTK\'s ne_chunk to extract named entities from a text:

my_sent = \"WASHINGTON -- In the wake of a string of abuses by New York police of         


        
7条回答
  •  挽巷
    挽巷 (楼主)
    2020-11-28 08:50

    use tree2conlltags from nltk.chunk. Also ne_chunk needs pos tagging which tags word tokens (thus needs word_tokenize).

    from nltk import word_tokenize, pos_tag, ne_chunk
    from nltk.chunk import tree2conlltags
    
    sentence = "Mark and John are working at Google."
    print(tree2conlltags(ne_chunk(pos_tag(word_tokenize(sentence))
    """[('Mark', 'NNP', 'B-PERSON'), 
        ('and', 'CC', 'O'), ('John', 'NNP', 'B-PERSON'), 
        ('are', 'VBP', 'O'), ('working', 'VBG', 'O'), 
        ('at', 'IN', 'O'), ('Google', 'NNP', 'B-ORGANIZATION'), 
        ('.', '.', 'O')] """
    

    This will give you a list of tuples: [(token, pos_tag, name_entity_tag)] If this list is not exactly what you want, it is certainly easier to parse the list you want out of this list then an nltk tree.

    Code and details from this link; check it out for more information

    You can also continue by only extracting the words, with the following function:

    def wordextractor(tuple1):
    
        #bring the tuple back to lists to work with it
        words, tags, pos = zip(*tuple1)
        words = list(words)
        pos = list(pos)
        c = list()
        i=0
        while i<= len(tuple1)-1:
            #get words with have pos B-PERSON or I-PERSON
            if pos[i] == 'B-PERSON':
                c = c+[words[i]]
            elif pos[i] == 'I-PERSON':
                c = c+[words[i]]
            i=i+1
    
        return c
    
    print(wordextractor(tree2conlltags(nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sentence))))
    

    Edit Added output docstring **Edit* Added Output only for B-Person

提交回复
热议问题