Python restrict newline characters for readlines()

可紊 提交于 2019-12-03 09:37:28

file.readlines() will only ever split on \n, \r or \r\n depending on the OS and if universal newline support is enabled.

U+0085 NEXT LINE (NEL) is not recognised as a newline splitter in that context, and you don't need to do anything special to have file.readlines() ignore it.

Quoting the open() function documentation:

Python is usually built with universal newlines support; supplying 'U' opens the file as a text file, but lines may be terminated by any of the following: the Unix end-of-line convention '\n', the Macintosh convention '\r', or the Windows convention '\r\n'. All of these external representations are seen as '\n' by the Python program. If Python is built without universal newlines support a mode with 'U' is the same as normal text mode. Note that file objects so opened also have an attribute called newlines which has a value of None (if no newlines have yet been seen), '\n', '\r', '\r\n', or a tuple containing all the newline types seen.

and the universal newlines glossary entry:

A manner of interpreting text streams in which all of the following are recognized as ending a line: the Unix end-of-line convention '\n', the Windows convention '\r\n', and the old Macintosh convention '\r'. See PEP 278 and PEP 3116, as well as str.splitlines() for an additional use.

Unfortunately, codecs.open() breaks with this rule; the documentation vaguely alludes to the specific codec being asked:

Line-endings are implemented using the codec’s decoder method and are included in the list entries if keepends is true.

Instead of codecs.open(), use io.open() to open the file in the correct encoding, then process the lines one by one:

with io.open(filename, encoding=correct_encoding) as f:
    lines = f.open()

io is the new I/O infrastructure that replaces the Python 2 system entirely in Python 3. It handles just \n, \r and \r\n:

>>> open('/tmp/test.txt', 'wb').write(u'Line 1 \x85 Line 1.1\r\nLine 2\r\nLine 3\r\n'.encode('utf8'))
>>> import codecs
>>> codecs.open('/tmp/test.txt', encoding='utf8').readlines()
[u'Line 1 \x85', u' Line 1.1\r\n', u'Line 2\r\n', u'Line 3\r\n']
>>> import io
>>> io.open('/tmp/test.txt', encoding='utf8').readlines()
[u'Line 1 \x85 Line 1.1\n', u'Line 2\n', u'Line 3\n']

The codecs.open() result is due to the code using str.splitlines() being used, which has a documentation bug; when splitting a unicode string, it'll split on anything that the Unicode standard deems to be a line break (which is quite a complex issue). The documentation for this method is falling short of explaining this; it claims to only split according to the Universal Newline rules.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!