Speed up reading in a compressed bz2 file ('rb' mode)

若如初见. 提交于 2021-02-11 14:21:37

问题


I have a BZ2 file of more than 10GB. I'd like to read it without decompressing it into a temporary file (it would be more than 50GB).

With this method:

import bz2, time
t0 = time.time()
time.sleep(0.001) # to avoid / by 0
with bz2.open("F:\test.bz2", 'rb') as f:
    for i, l in enumerate(f):
        if i % 100000 == 0:
            print('%i lines/sec' % (i/(time.time() - t0)))

I can only read ~ 250k lines per second. On a similar file, first decompressed, I get ~ 3M lines per second, i.e. a x10 factor:

with open("F:\test.txt", 'rb') as f:

I think it's not only due to the intrinsic decompression CPU time (because the total time of decompression into a temp file + the reading as uncompressed file is much smaller than the method described here), but maybe a lack of buffering, or other reasons. Are there other faster Python implementations of bz2.open?

How to speed up the reading of a BZ2 file, in binary mode, and loop over "lines"? (separated by \n)

Note: currently time to decompress test.bz2 into test.tmp + time to iterate over lines of test.tmp is far smaller than time to iterate over lines of bz2.open('test.bz2'), and this probably should not be the case.

Linked topic: https://discuss.python.org/t/non-optimal-bz2-reading-speed/6869


回答1:


You can use BZ2Decompressor to deal with huge files. It decompresses blocks of data incrementally, just out of the box:

t0 = time.time()
time.sleep(0.000001)
with open('temp.bz2', 'rb') as fi:
    decomp = bz2.BZ2Decompressor()
    residue = b''
    total_lines = 0
    for data in iter(lambda: fi.read(100 * 1024), b''):
        raw = residue + decomp.decompress(data) # process the raw data and  concatenate residual of the previous block to the beginning of the current raw data block
        residue = b''
        # process_data(current_block) => do the processing of the current data block
        current_block = raw.split(b'\n')
        if raw[-1] != b'\n':
            residue = current_block.pop() # last line could be incomplete
        total_lines += len(current_block)
        print('%i lines/sec' % (total_lines / (time.time() - t0)))
    # process_data(residue) => now finish processing the last line
    total_lines += 1
    print('Final: %i lines/sec' % (total_lines / (time.time() - t0)))

Here I read a chunk of binary file, feed it into a decompressor and receive a chunk of decompressed data. Be aware, the decompressed data chunks have to be concatenated to restore the original data. This is why last entry needs special treatment.

In my experiments it runs a little faster then your solution with io.BytesIO(). bz2 is known to be slow, so if it bothers you consider migration to snappy or zstandard.

Regarding the time it takes to process bz2 in Python. It might be fastest to decompress the file into temporary one using Linux utility and then process a normal text file. Otherwise you will be dependent on Python's implementation of bz2.




回答2:


This method already gives a x2 improvement over native bz2.open.

import bz2, time, io

def chunked_readlines(f):
    s = io.BytesIO()
    while True:
        buf = f.read(1024*1024)
        if not buf:
            return s.getvalue()
        s.write(buf)
        s.seek(0)
        L = s.readlines()
        yield from L[:-1]
        s = io.BytesIO()
        s.write(L[-1])  # very important: the last line read in the 1 MB chunk might be
                        # incomplete, so we keep it to be processed in the next iteration
                        # TODO: check if this is ok if f.read() stopped in the middle of a \r\n?

t0 = time.time()
i = 0
with bz2.open("D:\test.bz2", 'rb') as f:
    for l in chunked_readlines(f):       # 500k lines per second
    # for l in f:                        # 250k lines per second
        i += 1
        if i % 100000 == 0:
            print('%i lines/sec' % (i/(time.time() - t0)))

It is probably possible to do even better.

We could have a x4 improvement if we could use s as a a simple bytes object instead of a io.BytesIO. But unfortunately in this case, splitlines() does not behave as expected: splitlines() and iterating over an opened file give different results.



来源:https://stackoverflow.com/questions/65763959/speed-up-reading-in-a-compressed-bz2-file-rb-mode

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!