RegExp match repeated characters

前端 未结 6 1846
陌清茗
陌清茗 2020-11-30 03:47

For example I have string:

 aacbbbqq

As the result I want to have following matches:

 (aa, c, bbb, qq)  
相关标签:
6条回答
  • 2020-11-30 04:27

    The findall method will work if you capture the back-reference like so:

    result = [match[1] + match[0] for match in re.findall(r"(.)(\1*)", string)]
    
    0 讨论(0)
  • 2020-11-30 04:29

    itertools.groupby is not a RexExp, but it's not self-written either. :-) A quote from python docs:

    # [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D
    
    0 讨论(0)
  • 2020-11-30 04:35

    Generally

    The trick is to match a single char of the range you want, and then make sure you match all repetitions of the same character:

    >>> matcher= re.compile(r'(.)\1*')
    

    This matches any single character (.) and then its repetitions (\1*) if any.

    For your input string, you can get the desired output as:

    >>> [match.group() for match in matcher.finditer('aacbbbqq')]
    ['aa', 'c', 'bbb', 'qq']
    

    NB: because of the match group, re.findall won't work correctly.

    Other ranges

    In case you don't want to match any character, change accordingly the . in the regular expression:

    >>> matcher= re.compile(r'([a-z])\1*') # only lower case ASCII letters
    >>> matcher= re.compile(r'(?i)([a-z])\1*') # only ASCII letters
    >>> matcher= re.compile(r'(\w)\1*') # ASCII letters or digits or underscores
    >>> matcher= re.compile(r'(?u)(\w)\1*') # against unicode values, any letter or digit known to Unicode, or underscore
    

    Check the latter against u'hello²²' (Python 2.x) or 'hello²²' (Python 3.x):

    >>> text= u'hello=\xb2\xb2'
    >>> print('\n'.join(match.group() for match in matcher.finditer(text)))
    h
    e
    ll
    o
    ²²
    

    \w against non-Unicode strings / bytearrays might be modified if you first have issued a locale.setlocale call.

    0 讨论(0)
  • 2020-11-30 04:36

    This will work, see a working example here: http://www.rubular.com/r/ptdPuz0qDV

    (\w)\1*
    
    0 讨论(0)
  • 2020-11-30 04:39

    You can use:

    re.sub(r"(\w)\1*", r'\1', 'tessst')
    

    The output would be:

    'test'
    
    0 讨论(0)
  • 2020-11-30 04:45

    You can match that with: (\w)\1*

    0 讨论(0)
提交回复
热议问题