Duplicated rows when merging dataframes in python

耗尽温柔 提交于 2020-06-24 08:17:36

问题


I am currently merging 2 dataframes with an outer join, but after merging, I see all the rows are duplicated even when the columns I did the merge upon contain the same values. In detail:

list_1 = pd.read_csv('list_1.csv')
list_2 = pd.read_csv('list_2.csv')

merged_list = pd.merge(list_1 , list_2 , on=['email_address'], how='inner')

with the following input and results:

list_1:

email_address, name, surname
john.smith@email.com, john, smith
john.smith@email.com, john, smith
elvis@email.com, elvis, presley

list_2:

email_address, street, city
john.smith@email.com, street1, NY
john.smith@email.com, street1, NY
elvis@email.com, street2, LA

merged_list:

email_address, name, surname, street, city
john.smith@email.com, john, smith, street1, NY
john.smith@email.com, john, smith, street1, NY
john.smith@email.com, john, smith, street1, NY
john.smith@email.com, john, smith, street1, NY
elvis@email.com, elvis, presley, street2, LA
elvis@email.com, elvis, presley, street2, LA

My question is, shouldn't it be like this?

merged_list (how I would like it to be :D):

email_address, name, surname, street, city
john.smith@email.com, john, smith, street1, NY
john.smith@email.com, john, smith, street1, NY
elvis@email.com, elvis, presley, street2, LA

How can I make it so that it becomes like this? Thanks a lot for your help!


回答1:


list_2_nodups = list_2.drop_duplicates()
pd.merge(list_1 , list_2_nodups , on=['email_address'])

The duplicate rows are expected. Each john smith in list_1 matches with each john smith in list_2. I had to drop the duplicates in one of the lists. I chose list_2.



来源:https://stackoverflow.com/questions/39019591/duplicated-rows-when-merging-dataframes-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!