In a basic I had the next process.
import csv
reader = csv.reader(open(\'huge_file.csv\', \'rb\'))
for line in reader:
process_line(line)
There isn't a good way to do this for all .csv
files. You should be able to divide the file into chunks using file.seek to skip a section of the file. Then you have to scan one byte at a time to find the end of the row. The you can process the two chunks independently. Something like the following (untested) code should get you started.
file_one = open('foo.csv')
file_two = open('foo.csv')
file_two.seek(0, 2) # seek to the end of the file
sz = file_two.tell() # fetch the offset
file_two.seek(sz / 2) # seek back to the middle
chr = ''
while chr != '\n':
chr = file_two.read(1)
# file_two is now positioned at the start of a record
segment_one = csv.reader(file_one)
segment_two = csv.reader(file_two)
I'm not sure how you can tell that you have finished traversing segment_one
. If you have a column in the CSV that is a row id, then you can stop processing segment_one
when you encounter the row id from the first row in segment_two
.