How to tweak the NLTK sentence tokenizer

后端 未结 4 989
抹茶落季
抹茶落季 2020-12-02 08:13

I\'m using NLTK to analyze a few classic texts and I\'m running in to trouble tokenizing the text by sentence. For example, here\'s what I get for a snippet from Moby Di

4条回答
  •  感情败类
    2020-12-02 09:01

    So I had a similar issue and tried out vpekar's solution above.

    Perhaps mine is some sort of edge case but I observed the same behavior after applying the replacements, however, when I tried replacing the punctuation with the quotations placed before them, I got the output I was looking for. Presumably lack of adherence to MLA is less important than retaining the original quote as a single sentence.

    To be more clear:

    text = text.replace('?"', '"?').replace('!"', '"!').replace('."', '".')
    

    If MLA is important though you could always go back and reverse these changes wherever it counts.

提交回复
热议问题