Named Entity Recognition with Regular Expression: NLTK

后端未结

关注

 3  1595

-上瘾入骨i 2020-12-16 19:27

I have been playing with NLTK toolkit. I come across this problem a lot and searched for solution online but nowhere I got a satisfying answer. So I am putting my query here

3条回答

谎友^ (楼主)

2020-12-16 20:22

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    prev = None
    continuous_chunk = []
    current_chunk = []

    for i in chunked:
        if type(i) == Tree:
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    if continuous_chunk:
        named_entity = " ".join(current_chunk)
        if named_entity not in continuous_chunk:
            continuous_chunk.append(named_entity)

    return continuous_chunk

txt = "Barack Obama is a great person." 
print get_continuous_chunks(txt)

[out]:

['Barack Obama']

But do note that if the continuous chunk are not supposed to be a single NE, then you would be combining multiple NEs into one. I can't think of such an example off my head but i'm sure it would happen. But if they not continuous, the script above works fine:

>>> txt = "Barack Obama is the husband of Michelle Obama."  
>>> get_continuous_chunks(txt)
['Barack Obama', 'Michelle Obama']

0 讨论(0)

查看其它3个回答