I have a number of very large text files which I need to process, the largest being about 60GB.
Each line has 54 characters in seven fields and I want to remove the
ProcessLargeTextFile():
r = open("filepath", "r")
w = open("filepath", "w")
l = r.readline()
while l:
As has been suggested already, you may want to use a for loop to make this more optimal.
x = l.split(' ')[0]
y = l.split(' ')[1]
z = l.split(' ')[2]
You are performing a split operation 3 times here, depending on the size of each line this will have a detremental impact on performance. You should split once and assign x,y,z to the entries in the array that comes back.
w.write(l.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
Each line you are reading, you are writing immediately to the file, which is very I/O intensive. You should consider buffering your output to memory and pushing to the disk periodically. Something like this:
BUFFER_SIZE_LINES = 1024 # Maximum number of lines to buffer in memory
def ProcessLargeTextFile():
r = open("filepath", "r")
w = open("filepath", "w")
buf = ""
bufLines = 0
for lineIn in r:
x, y, z = lineIn.split(' ')[:3]
lineOut = lineIn.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3])
bufLines+=1
if bufLines >= BUFFER_SIZE:
# Flush buffer to disk
w.write(buf)
buf = ""
bufLines=1
buf += lineOut + "\n"
# Flush remaining buffer to disk
w.write(buf)
buf.close()
r.close()
w.close()
You can tweak BUFFER_SIZE to determine an optimal balance between memory usage and speed.