Python - How to gzip a large text file without MemoryError?

前端 未结 3 643
温柔的废话
温柔的废话 2020-12-16 01:21

I use the following simple Python script to compress a large text file (say, 10GB) on an EC2 m3.large instance. However, I always got a MemoryError

相关标签:
3条回答
  • 2020-12-16 01:59

    That's odd. I would expect this error if you tried to compress a large binary file that didn't contain many newlines, since such a file could contain a "line" that was too big for your RAM, but it shouldn't happen on a line-structured .csv file.

    But anyway, it's not very efficient to compress files line by line. Even though the OS buffers disk I/O it's generally much faster to read and write larger blocks of data, eg 64 kB.

    I have 2GB of RAM on this machine, and I just successfully used the program below to compress a 2.8GB tar archive.

    #! /usr/bin/env python
    
    import gzip
    import sys
    
    blocksize = 1 << 16     #64kB
    
    def gzipfile(iname, oname, level):
        with open(iname, 'rb') as f_in:
            f_out = gzip.open(oname, 'wb', level)
            while True:
                block = f_in.read(blocksize)
                if block == '':
                    break
                f_out.write(block)
            f_out.close()
        return
    
    
    def main():
        if len(sys.argv) < 3:
            print "gzip compress in_file to out_file"
            print "Usage:\n%s in_file out_file [compression_level]" % sys.argv[0]
            exit(1)
    
        iname = sys.argv[1]
        oname = sys.argv[2]
        level = int(sys.argv[3]) if len(sys.argv) > 3 else 6
    
        gzipfile(iname, oname, level)
    
    
    if __name__ == '__main__':  
        main()
    

    I'm running Python 2.6.6 and gzip.open() doesn't support with.


    As Andrew Bay notes in the comments, if block == '': won't work correctly in Python 3, since block contains bytes, not a string, and an empty bytes object doesn't compare as equal to an empty text string. We could check the block length, or compare to b'' (which will also work in Python 2.6+), but the simple way is if not block:.

    0 讨论(0)
  • 2020-12-16 02:00

    It is weird to get a memory error even when reading a file line by line. I suppose it is because you have very little available memory and very large lines. You should then use binary reads :

    import gzip
    
    #adapt size value : small values will take more time, high value could cause memory errors
    size = 8096
    
    with open('test_large.csv', 'rb') as f_in:
        with gzip.open('test_out.csv.gz', 'wb') as f_out:
            while True:
                data = f_in.read(size)
                if data == '' : break
                f_out.write(data)
    
    0 讨论(0)
  • 2020-12-16 02:08

    The problem here has nothing to do with gzip, and everything to do with reading line by line from a 10GB file with no newlines in it:

    As an additional note, the file I used to test the Python gzip functionality is generated by fallocate -l 10G bigfile_file.

    That gives you a 10GB sparse file made entirely of 0 bytes. Meaning there are no newline bytes. Meaning the first line is 10GB long. Meaning it will take 10GB to read the first line. (Or possibly even 20 or 40GB, if you're using pre-3.3 Python and trying to read it as Unicode.)

    If you want to copy binary data, don't copy line by line. Whether it's a normal file, a GzipFile that's decompressing for you on the fly, a socket.makefile(), or anything else, you will have the same problem.

    The solution is to copy chunk by chunk. Or just use copyfileobj, which does that for you automatically.

    import gzip
    import shutil
    
    with open('test_large.csv', 'rb') as f_in:
        with gzip.open('test_out.csv.gz', 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)
    

    By default, copyfileobj uses a chunk size optimized to be often very good and never very bad. In this case, you might actually want a smaller size, or a larger one; it's hard to predict which a priori.* So, test it by using timeit with different bufsize arguments (say, powers of 4 from 1KB to 8MB) to copyfileobj. But the default 16KB will probably be good enough unless you're doing a lot of this.

    * If the buffer size is too big, you may end up alternating long chunks of I/O and long chunks of processing. If it's too small, you may end up needing multiple reads to fill a single gzip block.

    0 讨论(0)
提交回复
热议问题