问题
handling of zero length matches has changed with python 3.7. Consider the following with python 3.6 (and previous):
>>> import re
>>> print(re.sub('a*', 'x', 'bac'))
xbxcx
>>> print(re.sub('.*', 'x', 'bac'))
x
We get the following with python 3.7:
>>> import re
>>> print(re.sub('a*', 'x', 'bac'))
xbxxcx
>>> print(re.sub('.*', 'x', 'bac'))
xx
I understand this is the standard behavior of PCRE. Furthermore, re.finditer() seems to have always detected the additional match:
>>> for m in re.finditer('a*', 'bac'):
... print(m.start(0), m.end(0), m.group(0))
...
0 0
1 2 a
2 2
3 3
That said, I'm interested in retrieving the behavior of python 3.6 (this is for a hobby project implementing sed in python).
I can come with the following solution:
def sub36(regex, replacement, string):
compiled = re.compile(regex)
class Match(object):
def __init__(self):
self.prevmatch = None
def __call__(self, match):
try:
if match.group(0) == '' and self.prevmatch and match.start(0) == self.prevmatch.end(0):
return ''
else:
return re._expand(compiled, match, replacement)
finally:
self.prevmatch = match
return compiled.sub(Match(), string)
which gives:
>>> print(re.sub('a*', 'x', 'bac'))
xbxxcx
>>> print(sub36('a*', 'x', 'bac'))
xbxcx
>>> print(re.sub('.*', 'x', 'bac'))
xx
>>> print(sub36('.*', 'x', 'bac'))
x
However, this seems very crafted for these examples.
What would be the right way to implement python 3.6 behavior for re.sub() zero length matches with python 3.7?
回答1:
Your solution may be in the regex egg:
Regex Egg Introduction
This regex implementation is backwards-compatible with the standard ‘re’ module, but offers additional functionality. The re module’s behaviour with zero-width matches changed in Python 3.7, and this module will follow that behaviour when compiled for Python 3.7.
Installation:
pip install regex
Usage:
With regex, you can specify the version (V0
, V1
) which regex pattern will be compiled with, i.e.:
# Python 3.7 and later
import regex
>>> regex.sub('.*', 'x', 'test')
'xx'
>>> regex.sub('.*?', '|', 'test')
'|||||||||'
# Python 3.6 and earlier
import regex
>>> regex.sub('(?V0).*', 'x', 'test')
'x'
>>> regex.sub('(?V1).*', 'x', 'test')
'xx'
>>> regex.sub('(?V0).*?', '|', 'test')
'|t|e|s|t|'
>>> regex.sub('(?V1).*?', '|', 'test')
'|||||||||'
Note:
Version can be indicated by
VERSION0
orV0
flag, or(?V0)
in the pattern.
Sources:
Regex thread - issue2636
regex 2018.11.22
回答2:
According to the 3.7 What's New,
The previous behavior can be restored by changing the pattern to
r'.+'
.
See https://docs.python.org/3/whatsnew/3.7.html under "Changes in the Python API". It seems that the solution would therefore be to modify such a regex; it doesn't seem like there's a flag you can pass to re
to request this behavior.
回答3:
PCRE (including Python 3.7+) that satisfies the original examples would be:
^a*|a+|(?<!a)$
https://regex101.com/r/zTpV1t/3
However, bbaacc
would get substituted to xbbxccx
(instead of the Python 3.6- version of a*
which produced xbxbxcxcx
) - it might still be good enough for some people.
来源:https://stackoverflow.com/questions/53642571/retrieving-python-3-6-handling-of-re-sub-with-zero-length-matches-in-python-3