I\'m looking for the fastest and idiomatic analog to SQL MINUS (AKA EXCEPT) operator.
Here is what I mean - given two Pandas DataFrames as follows:
I
We can use pandas.concat with drop_duplicates here and pass it the argument to drop all duplicates with keep=False
:
pd.concat([d1, d2]).drop_duplicates(['a', 'b'], keep=False)
a b c
1 0 1 2
2 1 0 3
6 2 2 7
Edit after comment by OP
If you want to make sure that unique rows in df2
arnt taken into account, we can duplicate that df
:
pd.concat([d1, pd.concat([d2]*2)]).drop_duplicates(['a', 'b'], keep=False)
a b c
1 0 1 2
2 1 0 3
6 2 2 7
In [100]: df1 = pd.concat([d1] * 10**5, ignore_index=True)
In [101]: df2 = pd.concat([d2] * 10**5, ignore_index=True)
In [102]: df1.shape
Out[102]: (700000, 3)
In [103]: df2.shape
Out[103]: (300000, 3)
pd.concat().drop_duplicates()
approach:In [10]: %%timeit
...: res = pd.concat([d1, pd.concat([d2]*2)]).drop_duplicates(['a', 'b'], keep=False)
...:
...:
2.59 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: %%timeit
...: res = df1[~df1.set_index(["a", "b"]).index.isin(df2.set_index(["a","b"]).index)]
...:
...:
484 ms ± 18.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [12]: %%timeit
...: tmp1 = df1.reset_index().set_index(["a", "b"])
...: idx = tmp1.index.difference(df2.set_index(["a","b"]).index)
...: res = df1.loc[tmp1.loc[idx, "index"]]
...:
...:
1.04 s ± 20.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
merge(how="outer")
approach - gives me a MemoryError
:In [106]: %%timeit
...: res = (df1.reset_index()
...: .merge(df2, on=['a','b'], indicator=True, how='outer', suffixes=('','_'))
...: .query('_merge == "left_only"')
...: .set_index('index')
...: .rename_axis(None)
...: .reindex(df1.columns, axis=1))
...:
...:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
In [13]: %%timeit
...: res = df1[~df1[['a','b']].astype(str).sum(axis=1).isin(df2[['a','b']].astype(str).sum(axis=1))]
...:
...:
2.05 s ± 65.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I had similar question, I tried your idea
(
In [65]: tmp1 = d1.reset_index().set_index(["a", "b"])
In [66]: idx = tmp1.index.difference(d2.set_index(["a","b"]).index)
In [67]: res = d1.loc[tmp1.loc[idx, "index"]]
)
for test and it works.
However, I use the way in my sqlite, tow databases that have the same Structure,that means its tables and tables' columns are the same, and it occurred some mistakes, it shows that this two df seems don't have the same shap.
if u r happy to give me a hand and want more details, we can have a further conversation thanks a lot
One possible solution with merge
and indicator=True
:
df = (d1.reset_index()
.merge(d2, on=['a','b'], indicator=True, how='outer', suffixes=('','_'))
.query('_merge == "left_only"')
.set_index('index')
.rename_axis(None)
.reindex(d1.columns, axis=1))
print (df)
a b c
1 0 1 2
2 1 0 3
6 2 2 7
Solution with isin
:
df = d1[~d1.set_index(["a", "b"]).index.isin(d2.set_index(["a","b"]).index)]
print (df)
a b c
1 0 1 2
2 1 0 3
6 2 2 7
I am thinking a little bit like excel here:
d1[~d1[['a','b']].astype(str).sum(axis=1).isin(d2[['a','b']].astype(str).sum(axis=1))]
a b c
1 0 1 2
2 1 0 3
6 2 2 7