Process very large (>20GB) text file line by line

后端未结

关注

 11  1749

慢半拍i 2020-11-29 17:54

I have a number of very large text files which I need to process, the largest being about 60GB.

Each line has 54 characters in seven fields and I want to remove the

11条回答

轻奢々 (楼主)

2020-11-29 18:13

It's more idiomatic to write your code like this

def ProcessLargeTextFile():
    with open("filepath", "r") as r, open("outfilepath", "w") as w:
        for line in r:
            x, y, z = line.split(' ')[:3]
            w.write(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))

The main saving here is to just do the split once, but if the CPU is not being taxed, this is likely to make very little difference

It may help to save up a few thousand lines at a time and write them in one hit to reduce thrashing of your harddrive. A million lines is only 54MB of RAM!

def ProcessLargeTextFile():
    bunchsize = 1000000     # Experiment with different sizes
    bunch = []
    with open("filepath", "r") as r, open("outfilepath", "w") as w:
        for line in r:
            x, y, z = line.split(' ')[:3]
            bunch.append(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
            if len(bunch) == bunchsize:
                w.writelines(bunch)
                bunch = []
        w.writelines(bunch)

suggested by @Janne, an alternative way to generate the lines

def ProcessLargeTextFile():
    bunchsize = 1000000     # Experiment with different sizes
    bunch = []
    with open("filepath", "r") as r, open("outfilepath", "w") as w:
        for line in r:
            x, y, z, rest = line.split(' ', 3)
            bunch.append(' '.join((x[:-3], y[:-3], z[:-3], rest)))
            if len(bunch) == bunchsize:
                w.writelines(bunch)
                bunch = []
        w.writelines(bunch)

0 讨论(0)

查看其它11个回答