Remove duplicate rows from a large file in Python

前端未结

关注

 6  1240

执笔经年 2020-12-15 00:27

I\'ve a csv file that I want to remove duplicate rows from, but it\'s too large to fit into memory. I found a way to get it done, but my guess is that it\'s not the best way

6条回答

一整个雨季 (楼主)

2020-12-15 01:08

How about using heapq module to read pieces of file up to memory limit and write them out the sorted pieces (heapq keeps things always in sorted order).

Or you could catch the first word in line and divide the file to pieces by that. Then you can read the lines (maybe do ' '.join(line.split()) to unify the spacing/tabs in line if it is OK to change spacing) in set in alphabetic order clearing the set between the pieces (set removes duplicates) to get things half sorted (set is not in order, if you want you can read in to heap and write out to get sorted order, last occurrence in set replacing old values as you go.) Alternatively you can also sort the piece and remove duplicate lines with Joe Koberg's groupby solution. Lastly you can join pieces back together (you can of course do the writing as you go piece by piece to final file during sorting of pieces)

0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...