Remove duplicate rows from a large file in Python

前端 未结 6 1244
执笔经年
执笔经年 2020-12-15 00:27

I\'ve a csv file that I want to remove duplicate rows from, but it\'s too large to fit into memory. I found a way to get it done, but my guess is that it\'s not the best way

6条回答
  •  陌清茗
    陌清茗 (楼主)
    2020-12-15 00:53

    Your original solution is slightly incorrect: you could have different lines hashing to the same value (a hash collision), and your code would leave one of them out.

    In terms of algorithmic complexity, if you're expecting relatively few duplicates, I think the fastest solution would be to scan the file line by line, adding the hash of each line (as you did), but also storing the location of that line. Then when you encounter a duplicate hash, seek to the original place to make sure that it is a duplicate and not just a hash collision, and if so, seek back and skip the line.

    By the way, if the CSV values are normalized (i.e., records are considered equal iff the corresponding CSV rows are equivalent byte-for-byte), you need not involve CSV parsing here at all, just deal with plain text lines.

提交回复
热议问题