How to tweak the NLTK sentence tokenizer

后端 未结 4 986
抹茶落季
抹茶落季 2020-12-02 08:13

I\'m using NLTK to analyze a few classic texts and I\'m running in to trouble tokenizing the text by sentence. For example, here\'s what I get for a snippet from Moby Di

4条回答
  •  时光说笑
    2020-12-02 09:15

    You can modify the NLTK's pre-trained English sentence tokenizer to recognize more abbreviations by adding them to the set _params.abbrev_types. For example:

    extra_abbreviations = ['dr', 'vs', 'mr', 'mrs', 'prof', 'inc', 'i.e']
    sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    sentence_tokenizer._params.abbrev_types.update(extra_abbreviations)
    

    Note that the abbreviations must be specified without the final period, but do include any internal periods, as in 'i.e' above. For details about the other tokenizer parameters, refer to the relevant documentation.

提交回复
热议问题