Remove duplicate rows from a large file in Python

前端 未结 6 1245
执笔经年
执笔经年 2020-12-15 00:27

I\'ve a csv file that I want to remove duplicate rows from, but it\'s too large to fit into memory. I found a way to get it done, but my guess is that it\'s not the best way

6条回答
  •  臣服心动
    2020-12-15 01:06

    You are basically doing a merge sort, and removing duplicated entries.

    Breaking the input into memory-sized pieces, sorting each of piece, then merging the pieces while removing duplicates is a sound idea in general.

    Actually, up to a couple of gigs I would let the virtual memory system handle it and just write:

    input = open(infilename, 'rb')
    output = open(outfile, 'wb')
    
    for key,  group in itertools.groupby(sorted(input)):
        output.write(key)
    

提交回复
热议问题