How to tweak the NLTK sentence tokenizer

后端未结

关注

 4  991

抹茶落季 2020-12-02 08:13

I\'m using NLTK to analyze a few classic texts and I\'m running in to trouble tokenizing the text by sentence. For example, here\'s what I get for a snippet from Moby Di

4条回答

南方客 (楼主)

2020-12-02 08:54

You need to supply a list of abbreviations to the tokenizer, like so:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters punkt_param = PunktParameters() punkt_param.abbrev_types = set(['dr', 'vs', 'mr', 'mrs', 'prof', 'inc']) sentence_splitter = PunktSentenceTokenizer(punkt_param) text = "is THAT what you mean, Mrs. Hussey?" sentences = sentence_splitter.tokenize(text)

sentences is now:

['is THAT what you mean, Mrs. Hussey?']

Update: This does not work if the last word of the sentence has an apostrophe or a quotation mark attached to it (like Hussey?'). So a quick-and-dirty way around this is to put spaces in front of apostrophes and quotes that follow sentence-end symbols (.!?):

text = text.replace('?"', '? "').replace('!"', '! "').replace('."', '. "')

0 讨论(0)

查看其它4个回答

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复