nltk tokenization and contractions

后端 未结 3 1948
执念已碎
执念已碎 2021-02-19 01:14

I\'m tokenizing text with nltk, just sentences fed to wordpunct_tokenizer. This splits contractions (e.g. \'don\'t\' to \'don\' +\" \' \"+\'t\') but I want to keep them as one w

相关标签:
3条回答
  • 2021-02-19 01:35

    Which tokenizer you use really depends on what you want to do next. As inspectorG4dget said, some part-of-speech taggers handle split contractions, and in that case the splitting is a good thing. But maybe that's not what you want. To decide which tokenizer is best, consider what you need for the next step, and then submit your text to http://text-processing.com/demo/tokenize/ to see how each NLTK tokenizer behaves.

    0 讨论(0)
  • 2021-02-19 01:39

    Because the number of contractions are very minimal, one way to do it is to search and replace all contractions to it full equivalent (Eg: "don't" to "do not") and then feed the updated sentences into the wordpunct_tokenizer.

    0 讨论(0)
  • 2021-02-19 02:02

    I've worked with NLTK before on this project. When I did, I found that contractions were useful to consider.

    However, I did not write custom tokenizer, I simply handled it after POS tagging.

    I suspect this is not the answer that you are looking for, but I hope it helps somewhat

    0 讨论(0)
提交回复
热议问题