问题
$ matches at the end of a line, which is defined as either the end of the string, or any location followed by a newline character.
However, the Windows newline flag contains two characters '\r\n', how to make '$' recognize '\r\n' as a newline character in bytes?
Here is what I have:
# Python 3.4.2
import re
input = b'''
//today is a good day \r\n
//this is Windows newline style \r\n
//unix line style \n
...other binary data...
'''
L = re.findall(rb'//.*?$', input, flags = re.DOTALL | re.MULTILINE)
for item in L : print(item)
now the output is:
b'//today is a good day \r'
b'//this is Windows newline style \r'
b'//unix line style '
but the expected output is as follows:
the expected output:
b'//today is a good day '
b'//this is Windows newline style '
b'//unix line style '
回答1:
It is not possible to redefine anchor behavior.
To match a // with any number of characters other than CR and LF after it, use a negated character class [^\r\n] with * quantifier:
L = re.findall(rb'//[^\r\n]*', input)
Note that this approach does not require using re.M and re.S flags.
Or, you can add \r? before a $ and enclose this part in a positive look-ahead (also, you will beed a *? lazy quantifier with .):
rb'//.*?(?=\r?$)'
The point in using a lookahead is that $ itself is a kind of a lookahead since it does not really consume the \n character. Thus, we can safely put it into a look-ahead with optional \r.
Maybe this is not that pertinent since it is from MSDN, but I think it is the same for Python:
Note that
$matches\nbut does not match\r\n(the combination of carriage return and newline characters, orCR/LF). To match theCR/LFcharacter combination, include\r?$in the regular expression pattern.
In PCRE, you can use (*ANYCRLF), (*CR) and (*ANY) to override the default behavior of the $ anchor, but not in Python.
回答2:
A hack, but...
re.findall(r'//.*?(?=\r|\n|(?!.))', input, re.DOTALL | re.MULTILINE)
This should replicate the behaviour of the default $ anchor (just before \r, \n or end of string).
回答3:
I think you also could use \v vertical space which would match [\n\cK\f\r\x85\x{2028}\x{2029}]
To not include it into the output use a lookahead: //.*(?=\v|$)
Test at regex101.com
来源:https://stackoverflow.com/questions/31399999/windows-newline-symbol-in-python-bytes-regex