Remove duplicate rows from a large file in Python

前端未结

关注

 6  1245

执笔经年 2020-12-15 00:27

I\'ve a csv file that I want to remove duplicate rows from, but it\'s too large to fit into memory. I found a way to get it done, but my guess is that it\'s not the best way

6条回答

臣服心动 (楼主)

2020-12-15 01:06
You are basically doing a merge sort, and removing duplicated entries.

Breaking the input into memory-sized pieces, sorting each of piece, then merging the pieces while removing duplicates is a sound idea in general.

Actually, up to a couple of gigs I would let the virtual memory system handle it and just write:
```
input = open(infilename, 'rb')
output = open(outfile, 'wb')

for key,  group in itertools.groupby(sorted(input)):
    output.write(key)
```
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...