Process very large (>20GB) text file line by line

后端 未结 11 1735
慢半拍i
慢半拍i 2020-11-29 17:54

I have a number of very large text files which I need to process, the largest being about 60GB.

Each line has 54 characters in seven fields and I want to remove the

11条回答
  •  轻奢々
    轻奢々 (楼主)
    2020-11-29 18:13

    It's more idiomatic to write your code like this

    def ProcessLargeTextFile():
        with open("filepath", "r") as r, open("outfilepath", "w") as w:
            for line in r:
                x, y, z = line.split(' ')[:3]
                w.write(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
    

    The main saving here is to just do the split once, but if the CPU is not being taxed, this is likely to make very little difference

    It may help to save up a few thousand lines at a time and write them in one hit to reduce thrashing of your harddrive. A million lines is only 54MB of RAM!

    def ProcessLargeTextFile():
        bunchsize = 1000000     # Experiment with different sizes
        bunch = []
        with open("filepath", "r") as r, open("outfilepath", "w") as w:
            for line in r:
                x, y, z = line.split(' ')[:3]
                bunch.append(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
                if len(bunch) == bunchsize:
                    w.writelines(bunch)
                    bunch = []
            w.writelines(bunch)
    

    suggested by @Janne, an alternative way to generate the lines

    def ProcessLargeTextFile():
        bunchsize = 1000000     # Experiment with different sizes
        bunch = []
        with open("filepath", "r") as r, open("outfilepath", "w") as w:
            for line in r:
                x, y, z, rest = line.split(' ', 3)
                bunch.append(' '.join((x[:-3], y[:-3], z[:-3], rest)))
                if len(bunch) == bunchsize:
                    w.writelines(bunch)
                    bunch = []
            w.writelines(bunch)
    

提交回复
热议问题