RegExp match repeated characters

前端 未结 6 1847
陌清茗
陌清茗 2020-11-30 03:47

For example I have string:

 aacbbbqq

As the result I want to have following matches:

 (aa, c, bbb, qq)  
6条回答
  •  野趣味
    野趣味 (楼主)
    2020-11-30 04:35

    Generally

    The trick is to match a single char of the range you want, and then make sure you match all repetitions of the same character:

    >>> matcher= re.compile(r'(.)\1*')
    

    This matches any single character (.) and then its repetitions (\1*) if any.

    For your input string, you can get the desired output as:

    >>> [match.group() for match in matcher.finditer('aacbbbqq')]
    ['aa', 'c', 'bbb', 'qq']
    

    NB: because of the match group, re.findall won't work correctly.

    Other ranges

    In case you don't want to match any character, change accordingly the . in the regular expression:

    >>> matcher= re.compile(r'([a-z])\1*') # only lower case ASCII letters
    >>> matcher= re.compile(r'(?i)([a-z])\1*') # only ASCII letters
    >>> matcher= re.compile(r'(\w)\1*') # ASCII letters or digits or underscores
    >>> matcher= re.compile(r'(?u)(\w)\1*') # against unicode values, any letter or digit known to Unicode, or underscore
    

    Check the latter against u'hello²²' (Python 2.x) or 'hello²²' (Python 3.x):

    >>> text= u'hello=\xb2\xb2'
    >>> print('\n'.join(match.group() for match in matcher.finditer(text)))
    h
    e
    ll
    o
    ²²
    

    \w against non-Unicode strings / bytearrays might be modified if you first have issued a locale.setlocale call.

提交回复
热议问题