nltk tokenization and contractions

后端 未结 3 1948
执念已碎 2021-02-19 01:14

I\'m tokenizing text with nltk, just sentences fed to wordpunct_tokenizer. This splits contractions (e.g. \'don\'t\' to \'don\' +\" \' \"+\'t\') but I want to keep them as one w

  • 2021-02-19 01:35

    Which tokenizer you use really depends on what you want to do next. As inspectorG4dget said, some part-of-speech taggers handle split contractions, and in that case the splitting is a good thing. But maybe that's not what you want. To decide which tokenizer is best, consider what you need for the next step, and then submit your text to to see how each NLTK tokenizer behaves.

    0 讨论(0)
  • 2021-02-19 01:39

    Because the number of contractions are very minimal, one way to do it is to search and replace all contractions to it full equivalent (Eg: "don't" to "do not") and then feed the updated sentences into the wordpunct_tokenizer.

    0 讨论(0)
  • 2021-02-19 02:02

    I've worked with NLTK before on this project. When I did, I found that contractions were useful to consider.

    However, I did not write custom tokenizer, I simply handled it after POS tagging.

    I suspect this is not the answer that you are looking for, but I hope it helps somewhat

    0 讨论(0)