Chunking Stanford Named Entity Recognizer (NER) outputs from NLTK format

前端 未结 4 1455
不思量自难忘°
不思量自难忘° 2020-12-15 10:29

I am using NER in NLTK to find persons, locations, and organizations in sentences. I am able to produce the results like this:

[(u\'Remaking\', u\'O\'), (u\'         


        
4条回答
  •  北海茫月
    2020-12-15 11:14

    It looks long but it does the work:

    ner_output = [(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION'), (u'Party', u'ORGANIZATION')]
    chunked, pos = [], ""
    for i, word_pos in enumerate(ner_output):
        word, pos = word_pos
        if pos in ['PERSON', 'ORGANIZATION', 'LOCATION'] and pos == prev_tag:
            chunked[-1]+=word_pos
        else:
            chunked.append(word_pos)
        prev_tag = pos
    
    clean_chunked = [tuple([" ".join(wordpos[::2]), wordpos[-1]]) if len(wordpos)!=2 else wordpos for wordpos in chunked]
    
    print clean_chunked
    

    [out]:

    [(u'Remaking', u'O'), (u'The', u'O'), (u'Republican Party', u'ORGANIZATION')]
    

    For more details:

    The first for-loop "with memory" achieves something like this:

    [(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION', u'Party', u'ORGANIZATION')]
    

    You'll realize that all Name Enitties will have more than 2 items in a tuple and what you want are the words as the elements in the list, i.e. 'Republican Party' in (u'Republican', u'ORGANIZATION', u'Party', u'ORGANIZATION'), so you'll do something like this to get the even elements:

    >>> x = [0,1,2,3,4,5,6]
    >>> x[::2]
    [0, 2, 4, 6]
    >>> x[1::2]
    [1, 3, 5]
    

    Then you also realized that the last element in the NE tuple is the tag you want, so you would do `

    >>> x = (u'Republican', u'ORGANIZATION', u'Party', u'ORGANIZATION')
    >>> x[::2]
    (u'Republican', u'Party')
    >>> x[-1]
    u'ORGANIZATION'
    

    It's a little ad-hoc and vebose but I hope it helps. And here it is in a function, Blessed Christmas:

    ner_output = [(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION'), (u'Party', u'ORGANIZATION')]
    
    
    def rechunk(ner_output):
        chunked, pos = [], ""
        for i, word_pos in enumerate(ner_output):
            word, pos = word_pos
            if pos in ['PERSON', 'ORGANIZATION', 'LOCATION'] and pos == prev_tag:
                chunked[-1]+=word_pos
            else:
                chunked.append(word_pos)
            prev_tag = pos
    
    
        clean_chunked = [tuple([" ".join(wordpos[::2]), wordpos[-1]]) 
                        if len(wordpos)!=2 else wordpos for wordpos in chunked]
    
        return clean_chunked
    
    
    print rechunk(ner_output)
    

提交回复
热议问题