I\'ve a csv file that I want to remove duplicate rows from, but it\'s too large to fit into memory. I found a way to get it done, but my guess is that it\'s not the best way
Your original solution is slightly incorrect: you could have different lines hashing to the same value (a hash collision), and your code would leave one of them out.
In terms of algorithmic complexity, if you're expecting relatively few duplicates, I think the fastest solution would be to scan the file line by line, adding the hash of each line (as you did), but also storing the location of that line. Then when you encounter a duplicate hash, seek to the original place to make sure that it is a duplicate and not just a hash collision, and if so, seek back and skip the line.
By the way, if the CSV values are normalized (i.e., records are considered equal iff the corresponding CSV rows are equivalent byte-for-byte), you need not involve CSV parsing here at all, just deal with plain text lines.