How to tweak the NLTK sentence tokenizer

后端未结

关注

 4  986

抹茶落季 2020-12-02 08:13

I\'m using NLTK to analyze a few classic texts and I\'m running in to trouble tokenizing the text by sentence. For example, here\'s what I get for a snippet from Moby Di

4条回答

时光说笑 (楼主)

2020-12-02 09:15

You can modify the NLTK's pre-trained English sentence tokenizer to recognize more abbreviations by adding them to the set _params.abbrev_types. For example:

extra_abbreviations = ['dr', 'vs', 'mr', 'mrs', 'prof', 'inc', 'i.e'] sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') sentence_tokenizer._params.abbrev_types.update(extra_abbreviations)

Note that the abbreviations must be specified without the final period, but do include any internal periods, as in 'i.e' above. For details about the other tokenizer parameters, refer to the relevant documentation.

0 讨论(0)

查看其它4个回答

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复