Regex to remove repeated character pattern in a string

前端 未结 4 1689
梦毁少年i
梦毁少年i 2020-12-14 22:30

I have a string that may have a repeated character pattern, e.g.

\'xyzzyxxyzzyxxyzzyx\'

I need to write a regex that would replace such str

相关标签:
4条回答
  • 2020-12-14 23:18

    How (using re module) write function, that remove all duplications.

    import re
    def remove_duplications(string):
        return re.sub(r'(.+?)\1+', r'\1', string)
    
    0 讨论(0)
  • 2020-12-14 23:23

    Since you want the smallest repeating pattern, something like the following should work for you:

    re.sub(r'^(.+?)\1+$', r'\1', input_string)
    

    The ^ and $ anchors make sure you don't get matches in the middle of the string, and by using .+? instead of just .+ you will get the shortest pattern (compare results using a string like 'aaaaaaaaaa').

    0 讨论(0)
  • 2020-12-14 23:23

    Try this regex pattern and capture the first group:

    ^(.+?)\1+$
    
    • ^ anchor for beginning of string/line
    • . any character except newlines
    • + quantifier to denote atleast 1 occurence
    • ? makes the + lazy instead of greedy, hence giving you the shortest pattern
    • () capturing group
    • \1+ backreference with quantifier to denote that pattern should repeat atleast once
    • $ anchor for end of string/line

    Test it here: Rubular


    The above solution does a lot of backtracking affecting performance. If you know the which characters are not allowed in these strings, then you can use a negated characted set which eliminates backtracking. For e.g., if whitespaces are not allowed, then

    ^([^\s]+)\1+$
    
    0 讨论(0)
  • 2020-12-14 23:27

    Use the following:

    > re.sub(r'(.+?)\1+', r'\1', 'xyzzyxxyzzyxxyzzyx')
    'xyzzyx'
    > re.sub(r'(.+?)\1+', r'\1', 'abcbaccbaabcbaccbaabcbaccba')
    'abcbaccba'
    > re.sub(r'(.+?)\1+', r'\1', 'iiiiiiiiiiiiiiiiii')
    'i'
    

    It basically matches a pattern that repeats itself (.+?)\1+, and removes everything but the repeating pattern, which is captured in the first group \1. Also note that using a reluctant qualifier here, i.e., +? will make the regex backtrack quite a lot.

    DEMO.

    0 讨论(0)
提交回复
热议问题