Removing duplicate rows from a csv file using a python script

后端 未结 6 1311
离开以前
离开以前 2020-12-02 10:06

Goal

I have downloaded a CSV file from hotmail, but it has a lot of duplicates in it. These duplicates are complete copies and I don\'t know why my

相关标签:
6条回答
  • 2020-12-02 10:34

    A more efficient version of @jamylak's solution: (with one less instruction)

    with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
        seen = set() # set for fast O(1) amortized lookup
        for line in in_file:
            if line not in seen: 
                seen.add(line)
                out_file.write(line)
    

    To edit the same file in-place you could use this

    import fileinput
    seen = set() # set for fast O(1) amortized lookup
    for line in fileinput.FileInput('1.csv', inplace=1):
        if line not in seen:
            seen.add(line)
            print line, # standard output is now redirected to the file   
    
    0 讨论(0)
  • 2020-12-02 10:40

    You can use the following script:

    pre-condition:

    1. 1.csv is the file that consists the duplicates
    2. 2.csv is the output file that will be devoid of the duplicates once this script is executed.

    code

    
    
    inFile = open('1.csv','r')
    
    outFile = open('2.csv','w')
    
    listLines = []
    
    for line in inFile:
    
        if line in listLines:
            continue
    
        else:
            outFile.write(line)
            listLines.append(line)
    
    outFile.close()
    
    inFile.close()
    

    Algorithm Explanation

    Here, what I am doing is:

    1. opening a file in the read mode. This is the file that has the duplicates.
    2. Then in a loop that runs till the file is over, we check if the line has already encountered.
    3. If it has been encountered than we don't write it to the output file.
    4. If not we will write it to the output file and add it to the list of records that have been encountered already
    0 讨论(0)
  • 2020-12-02 10:43

    You can do using pandas library in jupyter notebook or relevant IDE, I m importing pandas to jupyter notebook and reading the csv file

    Then sort the values,accordingly by which parameters duplicates are present, since I have defined two attributes first it will sort by time, then by latitude

    Then remove duplicates as present in time column or column relevant as per you

    Then i store the duplicates removed and sorted file as gps_sorted

    import pandas as pd
    stock=pd.read_csv("C:/Users/Donuts/GPS Trajectory/go_track_trackspoints.csv")
    stock2=stock.sort_values(["time","latitude"],ascending=True)
    stock2.drop_duplicates(subset=['time'])
    stock2.to_csv("C:/Users/Donuts/gps_sorted.csv",)
    

    Hope this helps

    0 讨论(0)
  • 2020-12-02 10:44

    you can achieve deduplicaiton efficiently using Pandas:

    import pandas as pd
    file_name = "my_file_with_dupes.csv"
    file_name_output = "my_file_without_dupes.csv"
    
    df = pd.read_csv(file_name, sep="\t or ,")
    
    # Notes:
    # - the `subset=None` means that every column is used 
    #    to determine if two rows are different; to change that specify
    #    the columns as an array
    # - the `inplace=True` means that the data structure is changed and
    #   the duplicate rows are gone  
    df.drop_duplicates(subset=None, inplace=True)
    
    # Write the results to a different file
    df.to_csv(file_name_output, index=False)
    
    0 讨论(0)
  • 2020-12-02 10:44

    I know this is long settled, but I have had a closely related problem whereby I was to remove duplicates based on one column. The input csv file was quite large to be opened on my pc by MS Excel/Libre Office Calc/Google Sheets; 147MB with about 2.5 million records. Since I did not want to install a whole external library for such a simple thing, I wrote the python script below to do the job in less than 5 minutes. I didn't focus on optimization, but I believe it can be optimized to run faster and more efficient for even bigger files. The algorithm is similar to @IcyFlame above, except that I am removing duplicates based on a column ('CCC') instead of whole row/line.

    import csv
    
    with open('results.csv', 'r') as infile, open('unique_ccc.csv', 'a') as outfile:
        # this list will hold unique ccc numbers,
        ccc_numbers = []
        # read input file into a dictionary, there were some null bytes in the infile
        results = csv.DictReader(infile)
        writer = csv.writer(outfile)
    
        # write column headers to output file
        writer.writerow(
            ['ID', 'CCC', 'MFLCode', 'DateCollected', 'DateTested', 'Result', 'Justification']
        )
        for result in results:
            ccc_number = result.get('CCC')
            # if value already exists in the list, skip writing it whole row to output file
            if ccc_number in ccc_numbers:
                continue
            writer.writerow([
                result.get('ID'),
                ccc_number,
                result.get('MFLCode'),
                result.get('datecollected'),
                result.get('DateTested'),
                result.get('Result'),
                result.get('Justification')
            ])
    
            # add the value to the list to so as to be skipped subsequently
            ccc_numbers.append(ccc_number)
    
    0 讨论(0)
  • 2020-12-02 10:50

    UPDATE: 2016

    If you are happy to use the helpful more_itertools external library:

    from more_itertools import unique_everseen
    with open('1.csv','r') as f, open('2.csv','w') as out_file:
        out_file.writelines(unique_everseen(f))
    

    A more efficient version of @IcyFlame's solution

    with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
        seen = set() # set for fast O(1) amortized lookup
        for line in in_file:
            if line in seen: continue # skip duplicate
    
            seen.add(line)
            out_file.write(line)
    

    To edit the same file in-place you could use this

    import fileinput
    seen = set() # set for fast O(1) amortized lookup
    for line in fileinput.FileInput('1.csv', inplace=1):
        if line in seen: continue # skip duplicate
    
        seen.add(line)
        print line, # standard output is now redirected to the file
    
    0 讨论(0)
提交回复
热议问题