I\'m using NLTK to analyze a few classic texts and I\'m running in to trouble tokenizing the text by sentence. For example, here\'s what I get for a snippet from Moby Di
You can modify the NLTK's pre-trained English sentence tokenizer to recognize more abbreviations by adding them to the set _params.abbrev_types. For example:
extra_abbreviations = ['dr', 'vs', 'mr', 'mrs', 'prof', 'inc', 'i.e']
sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentence_tokenizer._params.abbrev_types.update(extra_abbreviations)
Note that the abbreviations must be specified without the final period, but do include any internal periods, as in 'i.e' above. For details about the other tokenizer parameters, refer to the relevant documentation.