pandas merge df many to many without duplicates

后端 未结 3 505
深忆病人
深忆病人 2020-12-07 04:30

suppose i have two df like below:

import pandas as pd

data_dic = {
    \"a\": [0,0,1,2],
    \"b\": [3,3,4,5],
    \"c\": [6,7,8,9]
}
df1 = pd.DataFrame(dat         


        
相关标签:
3条回答
  • 2020-12-07 05:22

    You can remove the duplicated rows before merging

    df = pd.merge(
        df1.drop_duplicates(), 
        df2.drop_duplicates(), 
        on=['a', 'b'], how='inner'
    )
    print(df)
    
    #    a  b  c   d
    # 0  0  3  6  10
    # 1  0  3  7  10
    # 2  1  4  8  12
    # 3  2  5  9  13
    
    0 讨论(0)
  • 2020-12-07 05:27

    Use GroupBy.cumcount for counter columns in both DataFrames with merge by added column:

    df1['g'] = df1.groupby(['a','b']).cumcount()
    df2['g'] = df2.groupby(['a','b']).cumcount()
    
    df = pd.merge(df1, df2, on=['a', 'b', 'g'] , how='inner')
    print (df)
       a  b  c  g   d
    0  0  3  6  0  10
    1  0  3  7  1  10
    2  1  4  8  0  12
    3  2  5  9  0  13
    

    Difference with another solutions the best see in changed data in second df second 10 to 11 - it correct merge by first duplicate pair a, b from df1 with first a, b pais from second, similar for all duplicates and also for unique pairs:

    data_dic = {
        "a": [0,0,1,2],
        "b": [3,3,4,5],
        "d": [10,11,12,13]
    }
    df2 = pd.DataFrame(data_dic)
    
    
    df1['g'] = df1.groupby(['a','b']).cumcount()
    df2['g'] = df2.groupby(['a','b']).cumcount()
    
    df = pd.merge(df1, df2, on=['a', 'b', 'g'] , how='inner')
    print (df)
    
       a  b  c  g   d
    0  0  3  6  0  10
    1  0  3  7  1  11
    2  1  4  8  0  12
    3  2  5  9  0  13
    
    0 讨论(0)
  • 2020-12-07 05:34

    You could also drop duplicates after the merge

    data_dic = {
        "a": [0,0,1,2],
        "b": [3,3,4,5],
        "c": [6,7,8,9]
    }
    df1 = pd.DataFrame(data_dic)
    
    data_dic = {
        "a": [0,0,1,2],
        "b": [3,3,4,5],
        "d": [10,10,12,13]
    }
    df2 = pd.DataFrame(data_dic)
    
    
    df3 = pd.merge(df1, df2, how='inner', on=['a', 'b']).drop_duplicates()
    

    df3:

       a  b  c   d
    0  0  3  6  10
    2  0  3  7  10
    4  1  4  8  12
    5  2  5  9  13
    
    0 讨论(0)
提交回复
热议问题