Remove duplicate rows from a large file in Python

前端 未结 6 1246
执笔经年
执笔经年 2020-12-15 00:27

I\'ve a csv file that I want to remove duplicate rows from, but it\'s too large to fit into memory. I found a way to get it done, but my guess is that it\'s not the best way

6条回答
  •  太阳男子
    2020-12-15 01:15

    Your current method is not guaranteed to work properly.

    Firstly, there is the small probability that two lines that are actually different can produce the same hash value. hash(a) == hash(b) does not always mean that a == b

    Secondly, you are making the probability higher with your "reduce/lambda" caper:

    >>> reduce(lambda x,y: x+y, ['foo', '1', '23'])
    'foo123'
    >>> reduce(lambda x,y: x+y, ['foo', '12', '3'])
    'foo123'
    >>>
    

    BTW, wouldn't "".join(['foo', '1', '23']) be somewhat clearer?

    BTW2, why not use a set instead of a dict for htable?

    Here's a practical solution: get the "core utils" package from the GnuWin32 site, and install it. Then:

    1. write a copy of your file without the headings to (say) infile.csv
    2. c:\gnuwin32\bin\sort --unique -ooutfile.csv infile.csv
    3. read outfile.csv and write a copy with the headings prepended

    For each of steps 1 & 3, you could use a Python script, or some of the other GnuWin32 utilities (head, tail, tee, cat, ...).

提交回复
热议问题