Efficiently split a string using multiple separators and retaining each separator?

前端 未结 9 1559
野趣味
野趣味 2021-02-02 10:44

I need to split strings of data using each character from string.punctuation and string.whitespace as a separator.

Furthermore, I need for the

9条回答
  •  感动是毒
    2021-02-02 11:30

    A different non-regex approach from the others:

    >>> import string
    >>> from itertools import groupby
    >>> 
    >>> special = set(string.punctuation + string.whitespace)
    >>> s = "One two  three    tab\ttabandspace\t end"
    >>> 
    >>> split_combined = [''.join(g) for k, g in groupby(s, lambda c: c in special)]
    >>> split_combined
    ['One', ' ', 'two', '  ', 'three', '    ', 'tab', '\t', 'tabandspace', '\t ', 'end']
    >>> split_separated = [''.join(g) for k, g in groupby(s, lambda c: c if c in special else False)]
    >>> split_separated
    ['One', ' ', 'two', '  ', 'three', '    ', 'tab', '\t', 'tabandspace', '\t', ' ', 'end']
    

    Could use dict.fromkeys and .get instead of the lambda, I guess.

    [edit]

    Some explanation:

    groupby accepts two arguments, an iterable and an (optional) keyfunction. It loops through the iterable and groups them with the value of the keyfunction:

    >>> groupby("sentence", lambda c: c in 'nt')
    
    >>> [(k, list(g)) for k,g in groupby("sentence", lambda c: c in 'nt')]
    [(False, ['s', 'e']), (True, ['n', 't']), (False, ['e']), (True, ['n']), (False, ['c', 'e'])]
    

    where terms with contiguous values of the keyfunction are grouped together. (This is a common source of bugs, actually -- people forget that they have to sort by the keyfunc first if they want to group terms which might not be sequential.)

    As @JonClements guessed, what I had in mind was

    >>> special = dict.fromkeys(string.punctuation + string.whitespace, True)
    >>> s = "One two  three    tab\ttabandspace\t end"
    >>> [''.join(g) for k,g in groupby(s, special.get)]
    ['One', ' ', 'two', '  ', 'three', '    ', 'tab', '\t', 'tabandspace', '\t ', 'end']
    

    for the case where we were combining the separators. .get returns None if the value isn't in the dict.

提交回复
热议问题