Pandas/Python: How to concatenate two dataframes without duplicates?

前端未结

关注

 3  1096

执念已碎 2020-11-28 23:33

I\'d like to concatenate two dataframes A, B to a new one without duplicate rows (if rows in B already exist in A, don\'t add):

Dataframe A: Dataframe B:

3条回答

眼角桃花 (楼主)

2020-11-29 00:05

I'm surprised that pandas doesn't offer a native solution for this task. I don't think that it's efficient to just drop the duplicates if you work with large datasets (as Rian G suggested).

It is probably most efficient to use sets to find the non-overlapping indices. Then use list comprehension to translate from index to 'row location' (boolean), which you need to access rows using iloc[,]. Below you find a function that performs the task. If you don't choose a specific column (col) to check for duplicates, then indexes will be used, as you requested. If you chose a specific column, be aware that existing duplicate entries in 'a' will remain in the result.

import pandas as pd

def append_non_duplicates(a, b, col=None):
    if ((a is not None and type(a) is not pd.core.frame.DataFrame) or (b is not None and type(b) is not pd.core.frame.DataFrame)):
        raise ValueError('a and b must be of type pandas.core.frame.DataFrame.')
    if (a is None):
        return(b)
    if (b is None):
        return(a)
    if(col is not None):
        aind = a.iloc[:,col].values
        bind = b.iloc[:,col].values
    else:
        aind = a.index.values
        bind = b.index.values
    take_rows = list(set(bind)-set(aind))
    take_rows = [i in take_rows for i in bind]
    return(a.append( b.iloc[take_rows,:] ))

# Usage
a = pd.DataFrame([[1,2,3],[1,5,6],[1,12,13]], index=[1000,2000,5000])
b = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], index=[1000,2000,3000])

append_non_duplicates(a,b)
#        0   1   2
# 1000   1   2   3    <- from a
# 2000   1   5   6    <- from a
# 5000   1  12  13    <- from a
# 3000   7   8   9    <- from b

append_non_duplicates(a,b,0)
#       0   1   2
# 1000  1   2   3    <- from a
# 2000  1   5   6    <- from a
# 5000  1  12  13    <- from a
# 2000  4   5   6    <- from b
# 3000  7   8   9    <- from b

0 讨论(0)

查看其它3个回答