Python regex: tokenizing English contractions

前端 未结 5 1558
天涯浪人
天涯浪人 2021-01-20 21:59

I am trying to parse strings in such a way as to separate out all word components, even those that have been contracted. For example the tokenization of \"shouldn\'t\" wou

5条回答
  •  北恋
    北恋 (楼主)
    2021-01-20 22:39

    You can use this regex to tokenize the text:

    (?:(?!.')\w)+|\w?'\w+|[^\s\w]
    

    Usage:

    >>> re.findall(r"(?:(?!.')\w)+|\w?'\w+|[^\s\w]", "I wouldn't've done that.")
    ['I', 'would', "n't", "'ve", 'done', 'that', '.']
    

提交回复
热议问题