问题
With following dataframes, how do I extract and keep in different dataframes:
- rows with unique
Account
only - all rows with duplicated
Account
s
I have two datasets, df[0]
...:
Account Verified Paid Col1 Col2 Col3
1234 True True ... ... ...
1237 False True
1234 True True
4211 True True
1237 False True
312 False False
...and df[1]
:
Account Verified Paid Col1 Col2 Col3
41 True True ... ... ...
314 False False
41 True True
65 False False
To pass through all dataframes in my list, without replacing my df[i]
, and extract unique rows I used the following code:
filt = []
for i in range(0,1):
filt.append(df[i].groupby(list(df[i].Account)).agg('first').reset_index())
However, I would be also interested in passing through all dataframes in my list and, still not replacing my df, extract rows with duplicates.
For example, in the example above, I should have a dataframe that includes accounts 1234
and 1237
, and a dataframe that includes only 41
.
How could I get these two datasets?
回答1:
Use drop_duplicates() and duplicated('Account', keep=False) respectively.
You have two dataframes with some duplicates in the 'Account' column.
There's no need to write the line-by-line groupby
hack you wrote.
To get a dataframe with unique Accounts only i.e. duplicates dropped, use drop_duplicates(). See its
keep=‘first’/‘last’/False (i.e. drop all)
option, and inplace=True
option.
>>> df[0].drop_duplicates('Account')
Account Verified Paid Col1 Col2 Col3
0 1234 True True ... ... ...
1 1237 False True NaN NaN NaN
3 4211 True True NaN NaN NaN
5 312 False False NaN NaN NaN
>>> df[1].drop_duplicates('Account')
Account Verified Paid Col1 Col2 Col3
0 41 True True ... ... ...
1 314 False False NaN NaN NaN
3 65 False False NaN NaN NaN
and to get a dataframe with duplicated records only, use .duplicated('Account', keep=False) which means 'keep all duplicates'.
>>> df[0][ df[0].duplicated('Account', keep=False) ]
Account Verified Paid Col1 Col2 Col3
0 1234 True True ... ... ...
1 1237 False True NaN NaN NaN
2 1234 True True NaN NaN NaN
4 1237 False True NaN NaN NaN
>>> df[1][ df[1].duplicated('Account', keep=False) ]
Account Verified Paid Col1 Col2 Col3
0 41 True True ... ... ...
2 41 True True NaN NaN NaN
You might want to sort the last two dataframes in order of 'Account':
df[0][ df[0].duplicated('Account', keep=False) ].sort_values('Account')
Note: it's not very pandas-idiom to have a list df[i]
of multiple dataframes and iterate over it. Generally better to merge or concat the dataframes, and have one extra column to distinguish where they came from. (Also more efficient, we only need to do groupby
, apply
, drop_duplicates
etc. once)
回答2:
Let us try
filt = []
for i in range(0,1):
df=df[i]
filt.append(df[df.iloc[::-1].duplicated('MAT')])
来源:https://stackoverflow.com/questions/61856705/keep-duplicates-rows-in-multiple-dataframes