How can I split a text into sentences?

前端 未结 13 1202
傲寒
傲寒 2020-11-22 06:33

I have a text file. I need to get a list of sentences.

How can this be implemented? There are a lot of subtleties, such as a dot being used in abbreviations.

13条回答
  •  深忆病人
    2020-11-22 06:56

    Here is a middle of the road approach that doesn't rely on any external libraries. I use list comprehension to exclude overlaps between abbreviations and terminators as well as to exclude overlaps between variations on terminations, for example: '.' vs. '."'

    abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior',
                     'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
    terminators = ['.', '!', '?']
    wrappers = ['"', "'", ')', ']', '}']
    
    
    def find_sentences(paragraph):
       end = True
       sentences = []
       while end > -1:
           end = find_sentence_end(paragraph)
           if end > -1:
               sentences.append(paragraph[end:].strip())
               paragraph = paragraph[:end]
       sentences.append(paragraph)
       sentences.reverse()
       return sentences
    
    
    def find_sentence_end(paragraph):
        [possible_endings, contraction_locations] = [[], []]
        contractions = abbreviations.keys()
        sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
        for sentence_terminator in sentence_terminators:
            t_indices = list(find_all(paragraph, sentence_terminator))
            possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
        for contraction in contractions:
            c_indices = list(find_all(paragraph, contraction))
            contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
        possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
        if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
            max_end_start = max([pe[0] for pe in possible_endings])
            possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
        possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
        end = (-1 if not len(possible_endings) else max(possible_endings))
        return end
    
    
    def find_all(a_str, sub):
        start = 0
        while True:
            start = a_str.find(sub, start)
            if start == -1:
                return
            yield start
            start += len(sub)
    

    I used Karl's find_all function from this entry: Find all occurrences of a substring in Python

提交回复
热议问题