$ Windows newline symbol in Python bytes regex

让人想犯罪 __ 提交于 2021-02-02 08:56:37

问题


$ matches at the end of a line, which is defined as either the end of the string, or any location followed by a newline character.

However, the Windows newline flag contains two characters '\r\n', how to make '$' recognize '\r\n' as a newline character in bytes?

Here is what I have:

# Python 3.4.2
import re

input = b'''
//today is a good day \r\n
//this is Windows newline style \r\n
//unix line style \n
...other binary data... 
'''

L = re.findall(rb'//.*?$', input, flags = re.DOTALL | re.MULTILINE)
for item in L : print(item)

now the output is:

b'//today is a good day \r'
b'//this is Windows newline style \r'
b'//unix line style '

but the expected output is as follows:

the expected output:
b'//today is a good day '
b'//this is Windows newline style '
b'//unix line style '

回答1:


It is not possible to redefine anchor behavior.

To match a // with any number of characters other than CR and LF after it, use a negated character class [^\r\n] with * quantifier:

L = re.findall(rb'//[^\r\n]*', input)

Note that this approach does not require using re.M and re.S flags.

Or, you can add \r? before a $ and enclose this part in a positive look-ahead (also, you will beed a *? lazy quantifier with .):

rb'//.*?(?=\r?$)'

The point in using a lookahead is that $ itself is a kind of a lookahead since it does not really consume the \n character. Thus, we can safely put it into a look-ahead with optional \r.

Maybe this is not that pertinent since it is from MSDN, but I think it is the same for Python:

Note that $ matches \n but does not match \r\n (the combination of carriage return and newline characters, or CR/LF). To match the CR/LF character combination, include \r?$ in the regular expression pattern.

In PCRE, you can use (*ANYCRLF), (*CR) and (*ANY) to override the default behavior of the $ anchor, but not in Python.




回答2:


A hack, but...

re.findall(r'//.*?(?=\r|\n|(?!.))', input, re.DOTALL | re.MULTILINE)

This should replicate the behaviour of the default $ anchor (just before \r, \n or end of string).




回答3:


I think you also could use \v vertical space which would match [\n\cK\f\r\x85\x{2028}\x{2029}]

To not include it into the output use a lookahead: //.*(?=\v|$)

Test at regex101.com



来源:https://stackoverflow.com/questions/31399999/windows-newline-symbol-in-python-bytes-regex

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!