问题
I am currently merging 2 dataframes with an outer join, but after merging, I see all the rows are duplicated even when the columns I did the merge upon contain the same values. In detail:
list_1 = pd.read_csv('list_1.csv')
list_2 = pd.read_csv('list_2.csv')
merged_list = pd.merge(list_1 , list_2 , on=['email_address'], how='inner')
with the following input and results:
list_1:
email_address, name, surname
john.smith@email.com, john, smith
john.smith@email.com, john, smith
elvis@email.com, elvis, presley
list_2:
email_address, street, city
john.smith@email.com, street1, NY
john.smith@email.com, street1, NY
elvis@email.com, street2, LA
merged_list:
email_address, name, surname, street, city
john.smith@email.com, john, smith, street1, NY
john.smith@email.com, john, smith, street1, NY
john.smith@email.com, john, smith, street1, NY
john.smith@email.com, john, smith, street1, NY
elvis@email.com, elvis, presley, street2, LA
elvis@email.com, elvis, presley, street2, LA
My question is, shouldn't it be like this?
merged_list (how I would like it to be :D):
email_address, name, surname, street, city
john.smith@email.com, john, smith, street1, NY
john.smith@email.com, john, smith, street1, NY
elvis@email.com, elvis, presley, street2, LA
How can I make it so that it becomes like this? Thanks a lot for your help!
回答1:
list_2_nodups = list_2.drop_duplicates()
pd.merge(list_1 , list_2_nodups , on=['email_address'])
The duplicate rows are expected.  Each john smith in list_1 matches with each john smith in list_2.  I had to drop the duplicates in one of the lists.  I chose list_2.
来源:https://stackoverflow.com/questions/39019591/duplicated-rows-when-merging-dataframes-in-python