Keep duplicates rows in multiple dataframes

问题

With following dataframes, how do I extract and keep in different dataframes:

rows with unique Account only
all rows with duplicated Accounts

I have two datasets, df[0]...:

Account     Verified     Paid   Col1 Col2 Col3
1234        True        True     ...  ...  ...
1237        False       True    
1234        True        True
4211        True        True
1237        False       True
312         False       False

...and df[1]:

Account          Verified   Paid   Col1 Col2 Col3
41                True      True    ... ... ...
314               False     False
41                True      True
65                False     False

To pass through all dataframes in my list, without replacing my df[i], and extract unique rows I used the following code:

filt = [] 
for i in range(0,1): 
    filt.append(df[i].groupby(list(df[i].Account)).agg('first').reset_index())

However, I would be also interested in passing through all dataframes in my list and, still not replacing my df, extract rows with duplicates. For example, in the example above, I should have a dataframe that includes accounts 1234 and 1237, and a dataframe that includes only 41.

How could I get these two datasets?

回答1:

Use drop_duplicates() and duplicated('Account', keep=False) respectively.

You have two dataframes with some duplicates in the 'Account' column. There's no need to write the line-by-line groupby hack you wrote.

To get a dataframe with unique Accounts only i.e. duplicates dropped, use drop_duplicates(). See its keep=‘first’/‘last’/False (i.e. drop all) option, and inplace=True option.

>>> df[0].drop_duplicates('Account')    
   Account  Verified   Paid Col1 Col2 Col3
0     1234      True   True  ...  ...  ...
1     1237     False   True  NaN  NaN  NaN
3     4211      True   True  NaN  NaN  NaN
5      312     False  False  NaN  NaN  NaN

>>> df[1].drop_duplicates('Account')
   Account  Verified   Paid Col1 Col2 Col3
0       41      True   True  ...  ...  ...
1      314     False  False  NaN  NaN  NaN
3       65     False  False  NaN  NaN  NaN

and to get a dataframe with duplicated records only, use .duplicated('Account', keep=False) which means 'keep all duplicates'.

>>> df[0][ df[0].duplicated('Account', keep=False) ]
   Account  Verified  Paid Col1 Col2 Col3
0     1234      True  True  ...  ...  ...
1     1237     False  True  NaN  NaN  NaN
2     1234      True  True  NaN  NaN  NaN
4     1237     False  True  NaN  NaN  NaN
>>> df[1][ df[1].duplicated('Account', keep=False) ]
   Account  Verified  Paid Col1 Col2 Col3
0       41      True  True  ...  ...  ...
2       41      True  True  NaN  NaN  NaN

You might want to sort the last two dataframes in order of 'Account':

df[0][ df[0].duplicated('Account', keep=False) ].sort_values('Account')

Note: it's not very pandas-idiom to have a list df[i] of multiple dataframes and iterate over it. Generally better to merge or concat the dataframes, and have one extra column to distinguish where they came from. (Also more efficient, we only need to do groupby, apply, drop_duplicates etc. once)

回答2:

Let us try

filt = [] 
for i in range(0,1): 
    df=df[i]
    filt.append(df[df.iloc[::-1].duplicated('MAT')])

来源：https://stackoverflow.com/questions/61856705/keep-duplicates-rows-in-multiple-dataframes

标签

python

pandas

duplicates