Removing duplicate rows from a csv file using a python script

后端 未结 6 1314
离开以前
离开以前 2020-12-02 10:06

Goal

I have downloaded a CSV file from hotmail, but it has a lot of duplicates in it. These duplicates are complete copies and I don\'t know why my

6条回答
  •  伪装坚强ぢ
    2020-12-02 10:44

    I know this is long settled, but I have had a closely related problem whereby I was to remove duplicates based on one column. The input csv file was quite large to be opened on my pc by MS Excel/Libre Office Calc/Google Sheets; 147MB with about 2.5 million records. Since I did not want to install a whole external library for such a simple thing, I wrote the python script below to do the job in less than 5 minutes. I didn't focus on optimization, but I believe it can be optimized to run faster and more efficient for even bigger files. The algorithm is similar to @IcyFlame above, except that I am removing duplicates based on a column ('CCC') instead of whole row/line.

    import csv
    
    with open('results.csv', 'r') as infile, open('unique_ccc.csv', 'a') as outfile:
        # this list will hold unique ccc numbers,
        ccc_numbers = []
        # read input file into a dictionary, there were some null bytes in the infile
        results = csv.DictReader(infile)
        writer = csv.writer(outfile)
    
        # write column headers to output file
        writer.writerow(
            ['ID', 'CCC', 'MFLCode', 'DateCollected', 'DateTested', 'Result', 'Justification']
        )
        for result in results:
            ccc_number = result.get('CCC')
            # if value already exists in the list, skip writing it whole row to output file
            if ccc_number in ccc_numbers:
                continue
            writer.writerow([
                result.get('ID'),
                ccc_number,
                result.get('MFLCode'),
                result.get('datecollected'),
                result.get('DateTested'),
                result.get('Result'),
                result.get('Justification')
            ])
    
            # add the value to the list to so as to be skipped subsequently
            ccc_numbers.append(ccc_number)
    

提交回复
热议问题