问题
I was looking for a way to drop rows from my dataframe based on conditions to be checked with values in another row.
Here is my dataframe:
product product_id account_status
prod-A 100 active
prod-A 100 cancelled
prod-A 300 active
prod-A 400 cancelled
If a row with account_status='active' exists for a product & and product_id combination, then retain this row and delete other rows.
The desired output is:
product product_id account_status
prod-A 100 active
prod-A 300 active
prod-A 400 cancelled
I saw the solution mentioned here but couldn't replicate it for strings.
Please suggest.
回答1:
For more general solution removing only another account_status values per groups if exist at least one active value there:
print (df)
product product_id account_status
0 prod-A 100 active
1 prod-A 100 cancelled <- necessary remove
2 prod-A 300 active
3 prod-A 400 cancelled
4 prod-A 500 active
5 prod-A 500 active
6 prod-A 600 cancelled
7 prod-A 600 cancelled
s = df['account_status'].eq('active')
g = df.assign(A=s).groupby(['product','product_id'])['A']
mask = ~g.transform('any') | g.transform('all') | s
df = df[mask]
print (df)
product product_id account_status
0 prod-A 100 active
2 prod-A 300 active
3 prod-A 400 cancelled
4 prod-A 500 active
5 prod-A 500 active
6 prod-A 600 cancelled
7 prod-A 600 cancelled
Also working nice with multiple categories:
print (df)
product product_id account_status
0 prod-A 100 active
1 prod-A 100 cancelled <- necessary remove
2 prod-A 100 pending <- necessary remove
3 prod-A 300 active
4 prod-A 300 pending <- necessary remove
5 prod-A 400 cancelled
6 prod-A 500 active
7 prod-A 500 active
8 prod-A 600 pending
9 prod-A 600 cancelled
s = df['account_status'].eq('active')
g = df.assign(A=s).groupby(['product','product_id'])['A']
mask = ~g.transform('any') | g.transform('all') | s
df = df[mask]
print (df)
product product_id account_status
0 prod-A 100 active
3 prod-A 300 active
5 prod-A 400 cancelled
6 prod-A 500 active
7 prod-A 500 active
8 prod-A 600 pending
9 prod-A 600 cancelled
回答2:
IMO, groupby is not necessary (I say this because you have tagged your question accordingly), you can use sort_values and drop_duplicates, taking advantage of the fact that "active" < "cancelled", lexicographically:
(df.sort_values(['account_status'])
.drop_duplicates(['product', 'product_id'])
.sort_index())
product product_id account_status
0 prod-A 100 active
2 prod-A 300 active
3 prod-A 400 cancelled
In the spirit of being consistent the other answers, you may want to take a look at groupby-based solution involving duplicated and masking.
df
product product_id account_status
0 prod-A 100 active
1 prod-A 100 cancelled
2 prod-A 100 pending
3 prod-A 300 active
4 prod-A 300 pending
5 prod-A 400 cancelled
6 prod-A 500 active
7 prod-A 500 active
8 prod-A 600 pending
9 prod-A 600 cancelled
m1 = (df.assign(m=df.account_status.eq('active'))
.groupby(['product', 'product_id'])['m']
.transform('any'))
m2 = df.duplicated(['product', 'product_id'])
df[~(m1 & m2)]
product product_id account_status
0 prod-A 100 active
3 prod-A 300 active
5 prod-A 400 cancelled
6 prod-A 500 active
8 prod-A 600 pending
9 prod-A 600 cancelled
Like the other solution, this also generalises "nicely" to multiple categories, and will remove rows corresponding to other statuses only in groups where "active" is also present.
来源:https://stackoverflow.com/questions/53880231/deleting-rows-based-on-values-in-other-rows