regex.sub() gives different results to re.sub()

前端 未结 2 738
栀梦
栀梦 2021-01-04 05:07

I work with Czech accented text in Python 3.4.

Calling re.sub() to perform substitution by regex on an accented sentence works well, but using a regex compiled with

相关标签:
2条回答
  • 2021-01-04 05:28

    The last argument in the compile is flags, if you actually use flags=flags in the re.sub you will see the same behaviour:

    compiled = re.compile(pattern, flags)
    print(compiled)
    text = 'Poplatníkem daně z pozemků je vlastník pozemku'
    mark = r'**\1**' # wrap 1st matching group in double stars
    
    r = re.sub(pattern, mark, text, flags=flags)
    

    The fourth arg to re.sub is count so that is why you see the difference.

    re.sub(pattern, repl, string, count=0, flags=0)

    re.compile(pattern, flags=0)

    0 讨论(0)
  • 2021-01-04 05:37

    As Padraic Cunningham figured out, this is not actually a bug.

    However, it is related to a bug which you didn't run into, and to you using a flag you probably shouldn't be using, so I'll leave my earlier answer below, even though his is the right answer to your problem.


    There's a recent-ish change (somewhere between 3.4.1 and 3.4.3, and between 2.7.3 and 2.7.8) that affects this. Before that change, you can't even compile that pattern without raising an OverflowError.

    More importantly, why are you using re.L? The re.L mechanism does not mean "use the Unicode rules for my locale", it means "use some unspecified non-Unicode rules that only really make sense for Latin-1-derived locales and may not work right on Windows". Or, as the docs put it:

    Make \w, \W, \b, \B, \s and \S dependent on the current locale. The use of this flag is discouraged as the locale mechanism is very unreliable, and it only handles one “culture” at a time anyway; you should use Unicode matching instead, which is the default in Python 3 for Unicode (str) patterns.

    See bug #22407 and the linked python-dev thread for some recent discussion of this.

    And if I remove the re.L flag, the code now compiles just fine on 3.4.1. (I also get the "right" results on both 3.4.1 and 3.4.3, but that's just a coincidence; I'm now intentionally not passing the screwy flag and screwing it up in the first version, and still accidentally not passing the screwy flag and screwing it up in the second, so they match…)

    So, even if this were a bug, there's a good chance it would be closed WONTFIX. The resolution for #22407 was to deprecate re.L for non-bytes patterns in 3.5 and remove it in 3.6, so I doubt anyone's going to care about fixing bugs with it now. (Not to mention that re itself is theoretically going away in favor of regex one of these decades… and IIRC, regex also deprecated the L flag unless you're using a bytes pattern and re-compatible mode.)

    0 讨论(0)
提交回复
热议问题