Let\'s say that I have two tables: people_all
and people_usa
, both with the same structure and therefore the same primary key.
How can I g
Here is another similar to SQL Pandas method: .query():
people_all.query('ID not in @people_usa.ID')
or using NumPy's in1d() method:
people_all.[~np.in1d(people_all, people_usa)]
NOTE: for those who have experience with SQL it might be worth to read Pandas comparison with SQL
use isin
and negate the boolean mask:
people_usa[~people_usa['ID'].isin(people_all ['ID'])]
Example:
In [364]:
people_all = pd.DataFrame({ 'ID' : np.arange(5)})
people_usa = pd.DataFrame({ 'ID' : [3,4,6,7,100]})
people_usa[~people_usa['ID'].isin(people_all['ID'])]
Out[364]:
ID
2 6
3 7
4 100
so 3 and 4 are removed from the result, the boolean mask looks like this:
In [366]:
people_usa['ID'].isin(people_all['ID'])
Out[366]:
0 True
1 True
2 False
3 False
4 False
Name: ID, dtype: bool
using ~
inverts the mask
I would combine (by stacking) the data frames and then perform a .drop_duplicates method. Documentation found here:
http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.drop_duplicates.html