Parsing large (20GB) text file with python - reading in 2 lines as 1

不羁的心 提交于 2019-12-02 21:59:41

Quick google search for "python reading files larger than 4gb" yielded many many results. See here for such an example and another one which takes over from the first.

It's a bug in Python.

Now, the explanation of the bug; it's not easy to reproduce because it depends both on the internal FILE buffer size and the number of chars passed to fread(). In the Microsoft CRT source code, in open.c, there is a block starting with this encouraging comment "This is the hard part. We found a CR at end of buffer. We must peek ahead to see if next char is an LF." Oddly, there is an almost exact copy of this function in Perl source code: http://perl5.git.perl.org/perl.git/blob/4342f4d6df6a7dfa22a470aa21e54a5622c009f3:/win32/win32.c#l3668 The problem is in the call to SetFilePointer(), used to step back one position after the lookahead; it will fail because it is unable to return the current position in a 32bit DWORD. [The fix is easy; do you see it?] At this point, the function thinks that the next read() will return the LF, but it won't because the file pointer was not moved back.

And the work-around:

But note that Python 3.x is not affected (raw files are always opened in binary mode and CRLF translation is done by Python); with 2.7, you may use io.open().

The 4GB mark is suspiciously near the maximum value that can be stored in a 32-bit register (2**32).

The code you've posted looks fine by itself, so I would suspect a bug in your Python build.

FWIW, the snippet would be a little cleaner if it used enumerate:

inputFileHandle = open(inputFileName, 'r')

for row, line in enumerate(inputFileHandle):
    if line_meets_condition:
        outputFileHandle.write(line)
    else:
        lstIgnoredRows.append(row)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!