How to tweak the NLTK sentence tokenizer

后端 未结 4 984
抹茶落季
抹茶落季 2020-12-02 08:13

I\'m using NLTK to analyze a few classic texts and I\'m running in to trouble tokenizing the text by sentence. For example, here\'s what I get for a snippet from Moby Di

4条回答
  •  南方客
    南方客 (楼主)
    2020-12-02 08:54

    You need to supply a list of abbreviations to the tokenizer, like so:

    from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
    punkt_param = PunktParameters()
    punkt_param.abbrev_types = set(['dr', 'vs', 'mr', 'mrs', 'prof', 'inc'])
    sentence_splitter = PunktSentenceTokenizer(punkt_param)
    text = "is THAT what you mean, Mrs. Hussey?"
    sentences = sentence_splitter.tokenize(text)
    

    sentences is now:

    ['is THAT what you mean, Mrs. Hussey?']
    

    Update: This does not work if the last word of the sentence has an apostrophe or a quotation mark attached to it (like Hussey?'). So a quick-and-dirty way around this is to put spaces in front of apostrophes and quotes that follow sentence-end symbols (.!?):

    text = text.replace('?"', '? "').replace('!"', '! "').replace('."', '. "')
    

提交回复
热议问题