How to match a new line character in Python raw string

前端 未结 4 1441
無奈伤痛
無奈伤痛 2020-12-13 13:08

I got a little confused about Python raw string. I know that if we use raw string, then it will treat \'\\\' as a normal backslash (ex. r\'\\n\' wo

相关标签:
4条回答
  • 2020-12-13 13:39

    In a regular expression, you need to specify that you're in multiline mode:

    >>> import re
    >>> s = """cat
    ... dog"""
    >>> 
    >>> re.match(r'cat\ndog',s,re.M)
    <_sre.SRE_Match object at 0xcb7c8>
    

    Notice that re translates the \n (raw string) into newline. As you indicated in your comments, you don't actually need re.M for it to match, but it does help with matching $ and ^ more intuitively:

    >> re.match(r'^cat\ndog',s).group(0)
    'cat\ndog'
    >>> re.match(r'^cat$\ndog',s).group(0)  #doesn't match
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: 'NoneType' object has no attribute 'group'
    >>> re.match(r'^cat$\ndog',s,re.M).group(0) #matches.
    'cat\ndog'
    
    0 讨论(0)
  • 2020-12-13 13:39

    The simplest answer is to simply not use a raw string. You can escape backslashes by using \\.

    If you have huge numbers of backslashes in some segments, then you could concatenate raw strings and normal strings as needed:

    r"some string \ with \ backslashes" "\n"
    

    (Python automatically concatenates string literals with only whitespace between them.)

    Remember if you are working with paths on Windows, the easiest option is to just use forward slashes - it will still work fine.

    0 讨论(0)
  • 2020-12-13 13:41
    def clean_with_puncutation(text):    
        from string import punctuation
        import re
        punctuation_token={p:'<PUNC_'+p+'>' for p in punctuation}
        punctuation_token['<br/>']="<TOKEN_BL>"
        punctuation_token['\n']="<TOKEN_NL>"
        punctuation_token['<EOF>']='<TOKEN_EOF>'
        punctuation_token['<SOF>']='<TOKEN_SOF>'
      #punctuation_token
    
    
    
        regex = r"(<br/>)|(<EOF>)|(<SOF>)|[\n\!\@\#\$\%\^\&\*\(\)\[\]\
               {\}\;\:\,\.\/\?\|\`\_\\+\\\=\~\-\<\>]"
    
    ###Always put new sequence token at front to avoid overlapping results
     #text = '<EOF>!@#$%^&*()[]{};:,./<>?\|`~-= _+\<br/>\n <SOF>\ '
        text_=""
    
        matches = re.finditer(regex, text)
    
        index=0
    
        for match in matches:
         #print(match.group())
         #print(punctuation_token[match.group()])
         #print ("Match at index: %s, %s" % (match.start(), match.end()))
            text_=text_+ text[index:match.start()] +" " 
                  +punctuation_token[match.group()]+ " "
            index=match.end()
        return text_
    
    0 讨论(0)
  • 2020-12-13 13:43

    you also can use [\r\n] for matching to new line

    0 讨论(0)
提交回复
热议问题