I\'ve a csv file that I want to remove duplicate rows from, but it\'s too large to fit into memory. I found a way to get it done, but my guess is that it\'s not the best way
You are basically doing a merge sort, and removing duplicated entries.
Breaking the input into memory-sized pieces, sorting each of piece, then merging the pieces while removing duplicates is a sound idea in general.
Actually, up to a couple of gigs I would let the virtual memory system handle it and just write:
input = open(infilename, 'rb')
output = open(outfile, 'wb')
for key, group in itertools.groupby(sorted(input)):
output.write(key)