Python regex: tokenizing English contractions

前端 未结 5 1540
天涯浪人
天涯浪人 2021-01-20 21:59

I am trying to parse strings in such a way as to separate out all word components, even those that have been contracted. For example the tokenization of \"shouldn\'t\" wou

5条回答
  •  长发绾君心
    2021-01-20 22:36

    >>> import nltk
    >>> nltk.word_tokenize("I wouldn't've done that.")
    ['I', "wouldn't", "'ve", 'done', 'that', '.']
    

    so:

    >>> from itertools import chain
    >>> [nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]
    [['I'], ['would', "n't"], ["'ve"], ['done'], ['that'], ['.']]
    >>> list(chain(*[nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]))
    ['I', 'would', "n't", "'ve", 'done', 'that', '.']
    

提交回复
热议问题