Python replace multiple strings while supporting backreferences

可紊 提交于 2021-01-27 07:59:04

问题


There are some nice ways to handle simultaneous multi-string replacement in python. However, I am having trouble creating an efficient function that can do that while also supporting backreferences.

What i would like is to use a dictionary of expression / replacement terms, where the replacement terms may contain backreferences to something matched by the expression.

e.g. (note the \1)

repdict = {'&&':'and', '||':'or', '!([a-zA-Z_])':'not \1'}

I put the SO answer mentioned at the outset into the function below, which works fine for expression / replacement pairs that don't contain backreferences:

def replaceAll(repdict, text):
    repdict = dict((re.escape(k), v) for k, v in repdict.items())
    pattern = re.compile("|".join(repdict.keys()))
    return pattern.sub(lambda m: repdict[re.escape(m.group(0))], text)

However, it doesn't work for the key that does contain a backreference..

>>> replaceAll(repldict, "!newData.exists() || newData.val().length == 1")
'!newData.exists() or newData.val().length == 1'

If i do it manually, it works fine. e.g.:

pattern = re.compile("!([a-zA-Z_])")
pattern.sub(r'not \1', '!newData.exists()')

Works as expected:

'not newData.exists()'

In the fancy function, the escaping seems to be messing up the key that uses the backref, so it never matches anything.

I eventually came up with this. However, note that the problem of supporting backrefs in the input parameters is not solved, i'm just handling it manually in the replacer function:

def replaceAll(repPat, text):
    def replacer(obj):
        match = obj.group(0)
        # manually deal with exclamation mark match..
        if match[:1] == "!": return 'not ' + match[1:]
        # here we naively escape the matched pattern into
        # the format of our dictionary key
        else: return repPat[naive_escaper(match)]

    pattern = re.compile("|".join(repPat.keys()))
    return pattern.sub(replacer, text)

def naive_escaper(string):
    if '=' in string: return string.replace('=', '\=')
    elif '|' in string: return string.replace('|', '\|')
    else: return string

# manually escaping \ and = works fine
repPat = {'!([a-zA-Z_])':'', '&&':'and', '\|\|':'or', '\=\=\=':'=='}
replaceAll(repPat, "(!this && !that) || !this && foo === bar")

Returns:

'(not this and not that) or not this'

So if anyone has an idea how to make a multi-string replacement function that supports backreferences and accepts the replacement terms as input, I'd appreciate your feedback very much.


回答1:


Update: See Angus Hollands' answer for a better alternative.


I couldn't think of an easier way to do it than to stick with the original idea of combining all dict keys into one massive regex.

However, there are some difficulties. Let's assume a repldict like this:

repldict = {r'(a)': r'\1a', r'(b)': r'\1b'}

If we combine these to a single regex, we get (a)|(b) - so now (b) is no longer group 1, which means its backreference won't work correctly.

Another problem is that we can't tell which replacement to use. If the regex matches the text b, how can we find out that \1b is the appropriate replacement? It's not possible; we don't have enough information.

The solution to these problems is to enclose every dict key in a named group like so:

(?P<group1>(a))|(?P<group2>(b))

Now we can easily identify the key that matched, and recalculate the backreferences to make them relative to this group. so that \1b refers to "the first group after group2".


Here's the implementation:

def replaceAll(repldict, text):
    # split the dict into two lists because we need the order to be reliable
    keys, repls = zip(*repldict.items())

    # generate a regex pattern from the keys, putting each key in a named group
    # so that we can find out which one of them matched.
    # groups are named "_<idx>" where <idx> is the index of the corresponding
    # replacement text in the list above
    pattern = '|'.join('(?P<_{}>{})'.format(i, k) for i, k in enumerate(keys))

    def repl(match):
        # find out which key matched. We know that exactly one of the keys has
        # matched, so it's the only named group with a value other than None.
        group_name = next(name for name, value in match.groupdict().items()
                          if value is not None)
        group_index = int(group_name[1:])

        # now that we know which group matched, we can retrieve the
        # corresponding replacement text
        repl_text = repls[group_index]

        # now we'll manually search for backreferences in the
        # replacement text and substitute them
        def repl_backreference(m):
            reference_index = int(m.group(1))

            # return the corresponding group's value from the original match
            # +1 because regex starts counting at 1
            return match.group(group_index + reference_index + 1)  

        return re.sub(r'\\(\d+)', repl_backreference, repl_text)

    return re.sub(pattern, repl, text)

Tests:

repldict = {'&&':'and', r'\|\|':'or', r'!([a-zA-Z_])':r'not \1'}
print( replaceAll(repldict, "!newData.exists() || newData.val().length == 1") )

repldict = {'!([a-zA-Z_])':r'not \1', '&&':'and', r'\|\|':'or', r'\=\=\=':'=='}
print( replaceAll(repldict, "(!this && !that) || !this && foo === bar") )

# output: not newData.exists() or newData.val().length == 1
#         (not this and not that) or not this and foo == bar

Caveats:

  • Only numerical backreferences are supported; no named references.
  • Silently accepts invalid backreferences like {r'(a)': r'\2'}. (These will sometimes throw an error, but not always.)



回答2:


Similar solution to Rawing, only precomputing the expensive stuff ahead of time by modifying the group indices in backreferences. Also, using unnamed groups.

Here we silently wrap each case in a capture group, and then update any replacements with backreferences to correctly identify the appropriate subgroup by absolute position. Note, that when using a replacer function, backreferences do not work by default (you need to call match.expand).

import re
from collections import OrderedDict
from functools import partial

pattern_to_replacement = {'&&': 'and', '!([a-zA-Z_]+)': r'not \1'}


def build_replacer(cases):
    ordered_cases = OrderedDict(cases.items())
    replacements = {}

    leading_groups = 0
    for pattern, replacement in ordered_cases.items():
        leading_groups += 1

        # leading_groups is now the absolute position of the root group (back-references should be relative to this)
        group_index = leading_groups
        replacement = absolute_backreference(replacement, group_index)
        replacements[group_index] = replacement

        # This pattern contains N subgroups (determine by compiling pattern)
        subgroups = re.compile(pattern).groups
        leading_groups += subgroups

    catch_all = "|".join("({})".format(p) for p in ordered_cases)
    pattern = re.compile(catch_all)

    def replacer(match):
        replacement_pattern = replacements[match.lastindex]
        return match.expand(replacement_pattern)

    return partial(pattern.sub, replacer)


def absolute_backreference(text, n):
    ref_pat = re.compile(r"\\([0-99])")

    def replacer(match):
        return "\\{}".format(int(match.group(1)) + n)

    return ref_pat.sub(replacer, text)


replacer = build_replacer(pattern_to_replacement)
print(replacer("!this.exists()"))



回答3:


Simple is better than complex, code as below is more readable(The reason why you code not work as expected is that ([a-zA-Z_]) should not be in re.escape):

repdict = {
    r'\s*' + re.escape('&&')) + r'\s*': ' and ',
    r'\s*' + re.escape('||') + r'\s*': ' or ',
    re.escape('!') + r'([a-zA-Z_])': r'not \1',
}
def replaceAll(repdict, text):
    for k, v in repdict.items():
        text = re.sub(k, v, text)
    return text


来源:https://stackoverflow.com/questions/45128959/python-replace-multiple-strings-while-supporting-backreferences

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!