Pandas analogue to SQL MINUS / EXCEPT operator, using multiple columns

后端 未结 5 1752
离开以前
离开以前 2020-11-30 15:35

I\'m looking for the fastest and idiomatic analog to SQL MINUS (AKA EXCEPT) operator.

Here is what I mean - given two Pandas DataFrames as follows:

I         


        
相关标签:
5条回答
  • 2020-11-30 15:42

    We can use pandas.concat with drop_duplicates here and pass it the argument to drop all duplicates with keep=False:

    pd.concat([d1, d2]).drop_duplicates(['a', 'b'], keep=False)
    
       a  b  c
    1  0  1  2
    2  1  0  3
    6  2  2  7
    

    Edit after comment by OP

    If you want to make sure that unique rows in df2 arnt taken into account, we can duplicate that df:

    pd.concat([d1, pd.concat([d2]*2)]).drop_duplicates(['a', 'b'], keep=False)
    
       a  b  c
    1  0  1  2
    2  1  0  3
    6  2  2  7
    
    0 讨论(0)
  • 2020-11-30 15:42

    Execution time comparison for larger data sets:

    In [100]: df1 = pd.concat([d1] * 10**5, ignore_index=True)
    
    In [101]: df2 = pd.concat([d2] * 10**5, ignore_index=True)
    
    In [102]: df1.shape
    Out[102]: (700000, 3)
    
    In [103]: df2.shape
    Out[103]: (300000, 3)
    

    pd.concat().drop_duplicates() approach:

    In [10]: %%timeit
        ...: res = pd.concat([d1, pd.concat([d2]*2)]).drop_duplicates(['a', 'b'], keep=False)
        ...:
        ...:
    2.59 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    multi-index NOT IS IN approach:

    In [11]: %%timeit
        ...: res = df1[~df1.set_index(["a", "b"]).index.isin(df2.set_index(["a","b"]).index)]
        ...:
        ...:
    484 ms ± 18.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    multi-index difference approach:

    In [12]: %%timeit
        ...: tmp1 = df1.reset_index().set_index(["a", "b"])
        ...: idx = tmp1.index.difference(df2.set_index(["a","b"]).index)
        ...: res = df1.loc[tmp1.loc[idx, "index"]]
        ...:
        ...:
    1.04 s ± 20.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    merge(how="outer") approach - gives me a MemoryError:

    In [106]: %%timeit
         ...: res =  (df1.reset_index()
         ...:         .merge(df2, on=['a','b'], indicator=True, how='outer', suffixes=('','_'))
         ...:         .query('_merge == "left_only"')
         ...:         .set_index('index')
         ...:         .rename_axis(None)
         ...:         .reindex(df1.columns, axis=1))
         ...:
         ...:
    ---------------------------------------------------------------------------
    MemoryError                               Traceback (most recent call last)
    

    compare concatenated strings approach:

    In [13]: %%timeit
        ...: res = df1[~df1[['a','b']].astype(str).sum(axis=1).isin(df2[['a','b']].astype(str).sum(axis=1))]
        ...:
        ...:
    2.05 s ± 65.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    0 讨论(0)
  • 2020-11-30 15:57

    I had similar question, I tried your idea

    (
    In [65]: tmp1 = d1.reset_index().set_index(["a", "b"])
    
    In [66]: idx = tmp1.index.difference(d2.set_index(["a","b"]).index)
    
    In [67]: res = d1.loc[tmp1.loc[idx, "index"]]
    
    )
    

    for test and it works.

    However, I use the way in my sqlite, tow databases that have the same Structure,that means its tables and tables' columns are the same, and it occurred some mistakes, it shows that this two df seems don't have the same shap.

    if u r happy to give me a hand and want more details, we can have a further conversation thanks a lot

    0 讨论(0)
  • 2020-11-30 16:02

    One possible solution with merge and indicator=True:

    df = (d1.reset_index()
            .merge(d2, on=['a','b'], indicator=True, how='outer', suffixes=('','_'))
            .query('_merge == "left_only"')
            .set_index('index')
            .rename_axis(None)
            .reindex(d1.columns, axis=1))
    print (df)
       a  b  c
    1  0  1  2
    2  1  0  3
    6  2  2  7
    

    Solution with isin:

    df = d1[~d1.set_index(["a", "b"]).index.isin(d2.set_index(["a","b"]).index)]
    print (df)
       a  b  c
    1  0  1  2
    2  1  0  3
    6  2  2  7
    
    0 讨论(0)
  • 2020-11-30 16:04

    I am thinking a little bit like excel here:

    d1[~d1[['a','b']].astype(str).sum(axis=1).isin(d2[['a','b']].astype(str).sum(axis=1))]
    

       a  b  c
    1  0  1  2
    2  1  0  3
    6  2  2  7
    
    0 讨论(0)
提交回复
热议问题