Process very large (>20GB) text file line by line

后端 未结 11 1761
慢半拍i
慢半拍i 2020-11-29 17:54

I have a number of very large text files which I need to process, the largest being about 60GB.

Each line has 54 characters in seven fields and I want to remove the

11条回答
  •  孤街浪徒
    2020-11-29 18:10

    ProcessLargeTextFile():
        r = open("filepath", "r")
        w = open("filepath", "w")
        l = r.readline()
        while l:
    

    As has been suggested already, you may want to use a for loop to make this more optimal.

        x = l.split(' ')[0]
        y = l.split(' ')[1]
        z = l.split(' ')[2]
    

    You are performing a split operation 3 times here, depending on the size of each line this will have a detremental impact on performance. You should split once and assign x,y,z to the entries in the array that comes back.

        w.write(l.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
    

    Each line you are reading, you are writing immediately to the file, which is very I/O intensive. You should consider buffering your output to memory and pushing to the disk periodically. Something like this:

    BUFFER_SIZE_LINES = 1024 # Maximum number of lines to buffer in memory
    
    def ProcessLargeTextFile():
        r = open("filepath", "r")
        w = open("filepath", "w")
        buf = ""
        bufLines = 0
        for lineIn in r:
    
            x, y, z = lineIn.split(' ')[:3]
            lineOut = lineIn.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3])
            bufLines+=1
    
            if bufLines >= BUFFER_SIZE:
                # Flush buffer to disk
                w.write(buf)
                buf = ""
                bufLines=1
    
            buf += lineOut + "\n"
    
        # Flush remaining buffer to disk
        w.write(buf)
        buf.close()
        r.close()
        w.close()
    

    You can tweak BUFFER_SIZE to determine an optimal balance between memory usage and speed.

提交回复
热议问题