Removing duplicate rows from a csv file using a python script

后端未结

关注

 6  1310

离开以前 2020-12-02 10:06

Goal

I have downloaded a CSV file from hotmail, but it has a lot of duplicates in it. These duplicates are complete copies and I don\'t know why my

6条回答

独厮守ぢ (楼主)

2020-12-02 10:44

you can achieve deduplicaiton efficiently using Pandas:

import pandas as pd
file_name = "my_file_with_dupes.csv"
file_name_output = "my_file_without_dupes.csv"

df = pd.read_csv(file_name, sep="\t or ,")

# Notes:
# - the `subset=None` means that every column is used 
#    to determine if two rows are different; to change that specify
#    the columns as an array
# - the `inplace=True` means that the data structure is changed and
#   the duplicate rows are gone  
df.drop_duplicates(subset=None, inplace=True)

# Write the results to a different file
df.to_csv(file_name_output, index=False)

0 讨论(0)

查看其它6个回答