I\'ve a csv file that I want to remove duplicate rows from, but it\'s too large to fit into memory. I found a way to get it done, but my guess is that it\'s not the best way
Your current method is not guaranteed to work properly.
Firstly, there is the small probability that two lines that are actually different can produce the same hash value. hash(a) == hash(b) does not always mean that a == b
Secondly, you are making the probability higher with your "reduce/lambda" caper:
>>> reduce(lambda x,y: x+y, ['foo', '1', '23'])
'foo123'
>>> reduce(lambda x,y: x+y, ['foo', '12', '3'])
'foo123'
>>>
BTW, wouldn't "".join(['foo', '1', '23']) be somewhat clearer?
BTW2, why not use a set instead of a dict for htable?
Here's a practical solution: get the "core utils" package from the GnuWin32 site, and install it. Then:
c:\gnuwin32\bin\sort --unique -ooutfile.csv infile.csvFor each of steps 1 & 3, you could use a Python script, or some of the other GnuWin32 utilities (head, tail, tee, cat, ...).