python: read lines from compressed text files

前端 未结 4 646
长发绾君心
长发绾君心 2020-11-27 18:09

Is it easy to read a line from a gz-compressed text file using python without extracting the file completely? I have a text.gz file which is aroud 200mb. When I extract it,

相关标签:
4条回答
  • 2020-11-27 18:36

    Using gzip.GzipFile:

    import gzip
    
    with gzip.open('input.gz','rt') as f:
        for line in f:
            print('got line', line)
    

    Note: gzip.open(filename, mode) is an alias for gzip.GzipFile(filename, mode). I prefer the former, as it looks similar to with open(...) as f: used for opening uncompressed files.

    0 讨论(0)
  • 2020-11-27 18:45

    You could use the standard gzip module in python. Just use:

    gzip.open('myfile.gz')
    

    to open the file as any other file and read its lines.

    More information here: Python gzip module

    0 讨论(0)
  • 2020-11-27 18:57

    Have you tried using gzip.GzipFile? Arguments are similar to open.

    0 讨论(0)
  • 2020-11-27 19:02

    The gzip library (obviously) uses gzip, which can be a bit slow. You can speed things up with a system call to pigz, the parallelized version of gzip. The downsides are you have to install pigz and it will take more cores during the run, but it is much faster and not more memory intensive. The call to the file then becomes os.popen('pigz -dc ' + filename) instead of gzip.open(filename,'rt'). The pigz flags are -d for decompress and -c for stdout output which can then be grabbed by os.popen.

    The following code take in a file and a number (1 or 2) and counts the number of lines in the file with the different calls while measuring the time the code takes. Defining the following code in the unzip-file.py:

    #!/usr/bin/python
    import os
    import sys
    import time
    import gzip
    
    def local_unzip(obj):
        t0 = time.time()
        count = 0
        with obj as f:
            for line in f:
                count += 1
        print(time.time() - t0, count)
    
    r = sys.argv[1]
    if sys.argv[2] == "1":
        local_unzip(gzip.open(r,'rt'))
    else:
        local_unzip(os.popen('pigz -dc ' + r))
    

    Calling these using /usr/bin/time -f %M which measures the maximum memory usage of the process on a 28G file we get:

    $ /usr/bin/time -f %M ./unzip-file.py $file 1
    (3037.2604110240936, 1223422024)
    5116
    
    $ /usr/bin/time -f %M ./unzip-file.py $file 2
    (598.771901845932, 1223422024)
    4996
    

    Showing that the system call is about five times faster (10 minutes compared to 50 minutes) using basically the same maximum memory. It is also worth noting that depending on what you are doing per line reading in the file might not be the limiting factor, in which case the option you take does not matter.

    0 讨论(0)
提交回复
热议问题