Python regex: tokenizing English contractions

前端未结

关注

 5  1540

天涯浪人 2021-01-20 21:59

I am trying to parse strings in such a way as to separate out all word components, even those that have been contracted. For example the tokenization of \"shouldn\'t\" wou

5条回答

长发绾君心 (楼主)

2021-01-20 22:36

>>> import nltk
>>> nltk.word_tokenize("I wouldn't've done that.")
['I', "wouldn't", "'ve", 'done', 'that', '.']

so:

>>> from itertools import chain
>>> [nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]
[['I'], ['would', "n't"], ["'ve"], ['done'], ['that'], ['.']]
>>> list(chain(*[nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]))
['I', 'would', "n't", "'ve", 'done', 'that', '.']

0 讨论(0)

查看其它5个回答