Why my regex with r'string' matches but not 'string' using Python?

前端 未结 4 1130
滥情空心
滥情空心 2020-12-10 15:25

The way regex works in Python is so intensely puzzling that it makes me more furious with each passing second. Here\'s my problem:

I understand that this gives a res

相关标签:
4条回答
  • 2020-12-10 15:54

    The solution is the one you used yourself in the example above: raw strings.

    regex = '|'.join(r'\b' + str(state) + r'\b' for state in states)
    

    (Note that I also removed the extra brackets, turning the list comprehension into a generator expression.)

    0 讨论(0)
  • 2020-12-10 15:58

    The anwser itself

    regex = '|'.join([r'\b' + str(state) + r'\b' for state in states])
    

    The reason behind this is that the 'r' prefix tells Python to not analyze the string you pass to it. If you don't put an 'r' before the string, Python will try to turn any char preceding by '\' into a special char, to allow you to enter break lines (\n), tabs (\t) and such easily.

    When you do '\b', you tell Python to create a string, analyse it, and transform '\b' into 'backspace', while when you do r'\b', Python just store '\' then 'b', and this is what you want with for regex. Always use 'r' for string used as regex patterns.

    The 'r' notation is called 'raw string', but that's misleading, as there is no such thing as a raw string in Python internals. Just think about it as a way to tell Python to avoid being too smart.

    There is another notation in Python < 3.0, u'string', that tells Python to store the string as unicode. You can combine both: ur"é\n" will store "\bé" as unicode, while u"é\n" will store "é" then a line break.

    Some ways to improve your code:

    regex = '|'.join(r'\b' + str(state) + r'\b' for state in states)
    

    Removed the extra []. It tells Python to not store in memory the list of values you are generating. We can do it here because we don't plan to reuse the list you are creating since you use it directly in your join() and nowhere else.

    regex = '|'.join(r'\b%s\b' % state for state in states)
    

    This will take care of the string conversion automatically and is shorter and cleaner. When you format string in Python, think about the % operator.

    If states contain a list of states zip code, then there should be stored as string, not as int. In that case, you can skip the type casting and shorten it even more:

    regex = r'\b%s\b' % r'\b|\b'.join(states)
    

    Eventually, you may not need regex at all. If all you care is to check if one of the zip code is in the given string, you can just use in (check if an item is in an iterable, like if a string is in a list):

    matches = [s for s in states if s in 'grand rapids, mi 49505']
    

    Last word

    I understand you may be frustrated when learning a new language, but take the time to give a proper title to your question. In this website, the title should end with a question mark and give specific details about the problem.

    0 讨论(0)
  • 2020-12-10 16:01

    Let's break these two strings down:

    r'\bmi\b'
    

    Python interprets the above string as six characters long (backslash, letter B, etc.). A raw string suppresses Python's translation of \b into a backspace.

    re interprets the two characters \ and b as a word break.

    '\bmi\b'
    

    Python interprets the above string as four characters long (backspace, letter B, etc.).
    re now sees nothing special to interpret and looks for those literal four characters.

    So the construction below is looking for backspaces, not word breaks:

    regex = '|'.join(['\b' + str(state) + '\b' for state in states])
    

    Try this (dropping str, state should already be a string):

    regex = '|'.join([r'\b' + state + r'\b' for state in states])
    

    The word break doesn't need to be processed in every OR expression. Pulling it out simplifies the join:

    regex = r'\b(' + '|'.join(states) + r')\b'
    

    Since Pythonistas usually frown on regexes, might as well make a readable one:

    import re
    
    pattern = re.compile(r'''
        (?ix) # ignore case, verbose
        \b    # word break
        (     # begin group 1
        AL|AK|AZ|AR|CA|CO|CT|DE|FL|GA|
        HI|ID|IL|IN|IA|KS|KY|LA|ME|MD|
        MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|
        NM|NY|NC|ND|OH|OK|OR|PA|RI|SC|
        SD|TN|TX|UT|VT|VA|WA|WV|WI|WY
        )     # end group 1
        \b    # word break
        ''')
    
    m = pattern.search('Grand Rapids, MI 49505')
    if m:
        print m.group(1)
    
    0 讨论(0)
  • 2020-12-10 16:02

    The key is understanding the difference between '\b' and r'\b'. Typing these in IDLE results in this output:

    >>> '\b'
    '\x08'
    >>> r'\b'
    '\\b'
    

    So whenever you type in a backslash in a regex, you should escape it by using raw string notation.

    0 讨论(0)
提交回复
热议问题