Raw string and regular expression in Python

前端 未结 4 494
后悔当初
后悔当初 2020-12-03 08:45

I have some confusions regarding raw string in the following code:

import re

text2 = \'Today is 11/27/2012. PyCon starts 3/13/2013.\'
text2_re = re.sub(r\'(         


        
4条回答
  •  暖寄归人
    2020-12-03 09:35

    There is a distinction you have to make between the python interpreter and the re module.

    In python, a backslash followed by a character can mean a special character if the string is not rawed. For instance, \n will mean a newline character, \r will mean a carriage return, \t will mean the tab character, \b represents a nondestructive backspace. By itself, \d in a python string does not mean anything special.

    In regex however, there are a bunch of characters that would otherwise not always mean anything in python. But that's the catch, 'not always'. One of the things that can be misinterpreted is \b which in python is a backspace, in regex means a word boundary. What this implies is that if you pass on an unrawed \b to the regular expression part of a regex, this \b gets substituted by the backspace before it is passed to the regex function and it won't mean a thing there. So you have to absolutely pass the b with its backslash and to do that, you either escape the backslash, or raw the string.

    Back to your question regarding \d, \d has no special meaning whatsoever in python, so it remains untouched. The same \d passed as a regular expression gets converted by the regex engine, which is a separate entity to the python interpreter.


    Per question's edit:

    import re
    
    text2 = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
    text2_re = re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2)
    text2_re1 = re.sub('(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2)
    text2_re2 = re.sub(r'(\d+)/(\d+)/(\d+)', '\3-\1-\2', text2)
    text2_re3 = re.sub('(\d+)/(\d+)/(\d+)', '\3-\1-\2', text2)
    
    print(text2_re)
    print(text2_re1)
    print(text2_re2)
    print(text2_re3)
    

    The first two should be straightforward. re.sub does its thing by matching the numbers and forward slashes and replacing them in a different order with hyphens instead. Since \d does not have any special meaning in python, \d passed on to re.sub whether the expression is rawed or not.

    The third and fourth happens because you have not rawed the strings for the replace expression. \1, \2 and \3 have a special meaning in python, representing a white (or unfilled) smiley face, a black (filled) smiley face and a heart respectively (if the characters cannot be displayed, you get these 'character boxes'). So instead of replacing by the captured groups, you are replacing the strings by specific characters.

    enter image description here

提交回复
热议问题