I am trying to parse strings in such a way as to separate out all word components, even those that have been contracted. For example the tokenization of \"shouldn\'t\" wou
You can use this regex to tokenize the text:
(?:(?!.')\w)+|\w?'\w+|[^\s\w]
Usage:
>>> re.findall(r"(?:(?!.')\w)+|\w?'\w+|[^\s\w]", "I wouldn't've done that.") ['I', 'would', "n't", "'ve", 'done', 'that', '.']