Python - How to gzip a large text file without MemoryError?

前端 未结 3 644
温柔的废话
温柔的废话 2020-12-16 01:21

I use the following simple Python script to compress a large text file (say, 10GB) on an EC2 m3.large instance. However, I always got a MemoryError

3条回答
  •  春和景丽
    2020-12-16 01:59

    That's odd. I would expect this error if you tried to compress a large binary file that didn't contain many newlines, since such a file could contain a "line" that was too big for your RAM, but it shouldn't happen on a line-structured .csv file.

    But anyway, it's not very efficient to compress files line by line. Even though the OS buffers disk I/O it's generally much faster to read and write larger blocks of data, eg 64 kB.

    I have 2GB of RAM on this machine, and I just successfully used the program below to compress a 2.8GB tar archive.

    #! /usr/bin/env python
    
    import gzip
    import sys
    
    blocksize = 1 << 16     #64kB
    
    def gzipfile(iname, oname, level):
        with open(iname, 'rb') as f_in:
            f_out = gzip.open(oname, 'wb', level)
            while True:
                block = f_in.read(blocksize)
                if block == '':
                    break
                f_out.write(block)
            f_out.close()
        return
    
    
    def main():
        if len(sys.argv) < 3:
            print "gzip compress in_file to out_file"
            print "Usage:\n%s in_file out_file [compression_level]" % sys.argv[0]
            exit(1)
    
        iname = sys.argv[1]
        oname = sys.argv[2]
        level = int(sys.argv[3]) if len(sys.argv) > 3 else 6
    
        gzipfile(iname, oname, level)
    
    
    if __name__ == '__main__':  
        main()
    

    I'm running Python 2.6.6 and gzip.open() doesn't support with.


    As Andrew Bay notes in the comments, if block == '': won't work correctly in Python 3, since block contains bytes, not a string, and an empty bytes object doesn't compare as equal to an empty text string. We could check the block length, or compare to b'' (which will also work in Python 2.6+), but the simple way is if not block:.

提交回复
热议问题