Deleting rows based on values in other rows

主宰稳场 提交于 2019-12-01 08:34:22

For more general solution removing only another account_status values per groups if exist at least one active value there:

print (df)
  product  product_id account_status
0  prod-A         100         active
1  prod-A         100      cancelled <- necessary remove
2  prod-A         300         active
3  prod-A         400      cancelled
4  prod-A         500         active
5  prod-A         500         active
6  prod-A         600      cancelled
7  prod-A         600      cancelled

s = df['account_status'].eq('active')
g = df.assign(A=s).groupby(['product','product_id'])['A']
mask = ~g.transform('any') | g.transform('all') | s
df = df[mask]
print (df)
  product  product_id account_status
0  prod-A         100         active
2  prod-A         300         active
3  prod-A         400      cancelled
4  prod-A         500         active
5  prod-A         500         active
6  prod-A         600      cancelled
7  prod-A         600      cancelled

Also working nice with multiple categories:

print (df)
  product  product_id account_status
0  prod-A         100         active
1  prod-A         100      cancelled <- necessary remove
2  prod-A         100        pending <- necessary remove
3  prod-A         300         active
4  prod-A         300        pending <- necessary remove
5  prod-A         400      cancelled
6  prod-A         500         active
7  prod-A         500         active
8  prod-A         600        pending
9  prod-A         600      cancelled

s = df['account_status'].eq('active')
g = df.assign(A=s).groupby(['product','product_id'])['A']
mask = ~g.transform('any') | g.transform('all') | s
df = df[mask]
print (df)
  product  product_id account_status
0  prod-A         100         active
3  prod-A         300         active
5  prod-A         400      cancelled
6  prod-A         500         active
7  prod-A         500         active
8  prod-A         600        pending
9  prod-A         600      cancelled

IMO, groupby is not necessary (I say this because you have tagged your question accordingly), you can use sort_values and drop_duplicates, taking advantage of the fact that "active" < "cancelled", lexicographically:

(df.sort_values(['account_status'])
   .drop_duplicates(['product', 'product_id'])
   .sort_index())

  product  product_id account_status
0  prod-A         100         active
2  prod-A         300         active
3  prod-A         400      cancelled

In the spirit of being consistent the other answers, you may want to take a look at groupby-based solution involving duplicated and masking.

df
  product  product_id account_status
0  prod-A         100         active
1  prod-A         100      cancelled
2  prod-A         100        pending
3  prod-A         300         active
4  prod-A         300        pending
5  prod-A         400      cancelled
6  prod-A         500         active
7  prod-A         500         active
8  prod-A         600        pending
9  prod-A         600      cancelled


m1 = (df.assign(m=df.account_status.eq('active'))
        .groupby(['product', 'product_id'])['m']
        .transform('any'))
m2 = df.duplicated(['product', 'product_id'])

df[~(m1 & m2)]

  product  product_id account_status
0  prod-A         100         active
3  prod-A         300         active
5  prod-A         400      cancelled
6  prod-A         500         active
8  prod-A         600        pending
9  prod-A         600      cancelled

Like the other solution, this also generalises "nicely" to multiple categories, and will remove rows corresponding to other statuses only in groups where "active" is also present.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!