Remove duplicate rows from a large file in Python

前端未结

关注

 6  1246

执笔经年 2020-12-15 00:27

I\'ve a csv file that I want to remove duplicate rows from, but it\'s too large to fit into memory. I found a way to get it done, but my guess is that it\'s not the best way

6条回答

太阳男子 (楼主)

2020-12-15 01:15
Your current method is not guaranteed to work properly.

Firstly, there is the small probability that two lines that are actually different can produce the same hash value. hash(a) == hash(b) does not always mean that a == b

Secondly, you are making the probability higher with your "reduce/lambda" caper:
```
>>> reduce(lambda x,y: x+y, ['foo', '1', '23'])
'foo123'
>>> reduce(lambda x,y: x+y, ['foo', '12', '3'])
'foo123'
>>>
```
BTW, wouldn't "".join(['foo', '1', '23']) be somewhat clearer?

BTW2, why not use a set instead of a dict for htable?

Here's a practical solution: get the "core utils" package from the GnuWin32 site, and install it. Then:
1. write a copy of your file without the headings to (say) infile.csv
2. c:\gnuwin32\bin\sort --unique -ooutfile.csv infile.csv
3. read outfile.csv and write a copy with the headings prepended
For each of steps 1 & 3, you could use a Python script, or some of the other GnuWin32 utilities (head, tail, tee, cat, ...).
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...