I need to split strings of data using each character from string.punctuation and string.whitespace as a separator.
Furthermore, I need for the
A different non-regex approach from the others:
>>> import string
>>> from itertools import groupby
>>>
>>> special = set(string.punctuation + string.whitespace)
>>> s = "One two three tab\ttabandspace\t end"
>>>
>>> split_combined = [''.join(g) for k, g in groupby(s, lambda c: c in special)]
>>> split_combined
['One', ' ', 'two', ' ', 'three', ' ', 'tab', '\t', 'tabandspace', '\t ', 'end']
>>> split_separated = [''.join(g) for k, g in groupby(s, lambda c: c if c in special else False)]
>>> split_separated
['One', ' ', 'two', ' ', 'three', ' ', 'tab', '\t', 'tabandspace', '\t', ' ', 'end']
Could use dict.fromkeys and .get instead of the lambda, I guess.
[edit]
Some explanation:
groupby accepts two arguments, an iterable and an (optional) keyfunction. It loops through the iterable and groups them with the value of the keyfunction:
>>> groupby("sentence", lambda c: c in 'nt')
>>> [(k, list(g)) for k,g in groupby("sentence", lambda c: c in 'nt')]
[(False, ['s', 'e']), (True, ['n', 't']), (False, ['e']), (True, ['n']), (False, ['c', 'e'])]
where terms with contiguous values of the keyfunction are grouped together. (This is a common source of bugs, actually -- people forget that they have to sort by the keyfunc first if they want to group terms which might not be sequential.)
As @JonClements guessed, what I had in mind was
>>> special = dict.fromkeys(string.punctuation + string.whitespace, True)
>>> s = "One two three tab\ttabandspace\t end"
>>> [''.join(g) for k,g in groupby(s, special.get)]
['One', ' ', 'two', ' ', 'three', ' ', 'tab', '\t', 'tabandspace', '\t ', 'end']
for the case where we were combining the separators. .get returns None if the value isn't in the dict.