Python regex: tokenizing English contractions

前端未结

关注

 5  1558

天涯浪人 2021-01-20 21:59

I am trying to parse strings in such a way as to separate out all word components, even those that have been contracted. For example the tokenization of \"shouldn\'t\" wou

5条回答

北恋 (楼主)

2021-01-20 22:39

You can use this regex to tokenize the text:

(?:(?!.')\w)+|\w?'\w+|[^\s\w]

Usage:

>>> re.findall(r"(?:(?!.')\w)+|\w?'\w+|[^\s\w]", "I wouldn't've done that.")
['I', 'would', "n't", "'ve", 'done', 'that', '.']

0 讨论(0)

查看其它5个回答