Removing duplicate rows from a csv file using a python script

后端 未结 6 1310
离开以前
离开以前 2020-12-02 10:06

Goal

I have downloaded a CSV file from hotmail, but it has a lot of duplicates in it. These duplicates are complete copies and I don\'t know why my

6条回答
  •  独厮守ぢ
    2020-12-02 10:44

    you can achieve deduplicaiton efficiently using Pandas:

    import pandas as pd
    file_name = "my_file_with_dupes.csv"
    file_name_output = "my_file_without_dupes.csv"
    
    df = pd.read_csv(file_name, sep="\t or ,")
    
    # Notes:
    # - the `subset=None` means that every column is used 
    #    to determine if two rows are different; to change that specify
    #    the columns as an array
    # - the `inplace=True` means that the data structure is changed and
    #   the duplicate rows are gone  
    df.drop_duplicates(subset=None, inplace=True)
    
    # Write the results to a different file
    df.to_csv(file_name_output, index=False)
    

提交回复
热议问题