Extract city names from text using python

前端 未结 1 1832
温柔的废话
温柔的废话 2020-12-18 16:25

I have a dataset where the title of one column is \"What is your location and time zone?\"

This has meant that we have entries like

  1. Denmark, CET
  2. <
1条回答
  •  离开以前
    2020-12-18 16:38

    I would use what Natural Language Processing and nltk has to offer to extract entities.

    Example (heavily based on this gist) which tokenizes each line from a file, splits it into chunks and looks for NE (named entity) labels for every chunk recursively. More explanation here:

    import nltk
    
    def extract_entity_names(t):
        entity_names = []
    
        if hasattr(t, 'label') and t.label:
            if t.label() == 'NE':
                entity_names.append(' '.join([child[0] for child in t]))
            else:
                for child in t:
                    entity_names.extend(extract_entity_names(child))
    
        return entity_names
    
    with open('sample.txt', 'r') as f:
        for line in f:
            sentences = nltk.sent_tokenize(line)
            tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
            tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
            chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)
    
            entities = []
            for tree in chunked_sentences:
                entities.extend(extract_entity_names(tree))
    
            print(entities)
    

    For the sample.txt containing:

    Denmark, CET
    Location is Devon, England, GMT time zone
    Australia. Australian Eastern Standard Time. +10h UTC.
    My location is Eugene, Oregon for most of the year or in Seoul, South Korea depending on school holidays. My primary time zone is the Pacific time zone.
    For the entire May I will be in London, United Kingdom (GMT+1). For the entire June I will be in either Norway (GMT+2) or Israel (GMT+3) with limited internet access. For the entire July and August I will be in London, United Kingdom (GMT+1). And then from September, 2015, I will be in Boston, United States (EDT)
    

    It prints:

    ['Denmark', 'CET']
    ['Location', 'Devon', 'England', 'GMT']
    ['Australia', 'Australian Eastern Standard Time']
    ['Eugene', 'Oregon', 'Seoul', 'South Korea', 'Pacific']
    ['London', 'United Kingdom', 'Norway', 'Israel', 'London', 'United Kingdom', 'Boston', 'United States', 'EDT']
    

    The output is not ideal, but might be a good start for you.

    0 讨论(0)
提交回复
热议问题