Deleting rows based on values in other rows

梦想的初衷 提交于 2019-12-01 07:21:09

问题


I was looking for a way to drop rows from my dataframe based on conditions to be checked with values in another row.

Here is my dataframe:

product product_id  account_status
prod-A  100         active
prod-A  100         cancelled
prod-A  300         active
prod-A  400         cancelled

If a row with account_status='active' exists for a product & and product_id combination, then retain this row and delete other rows.

The desired output is:

product product_id  account_status
prod-A  100         active
prod-A  300         active
prod-A  400         cancelled

I saw the solution mentioned here but couldn't replicate it for strings.

Please suggest.


回答1:


For more general solution removing only another account_status values per groups if exist at least one active value there:

print (df)
  product  product_id account_status
0  prod-A         100         active
1  prod-A         100      cancelled <- necessary remove
2  prod-A         300         active
3  prod-A         400      cancelled
4  prod-A         500         active
5  prod-A         500         active
6  prod-A         600      cancelled
7  prod-A         600      cancelled

s = df['account_status'].eq('active')
g = df.assign(A=s).groupby(['product','product_id'])['A']
mask = ~g.transform('any') | g.transform('all') | s
df = df[mask]
print (df)
  product  product_id account_status
0  prod-A         100         active
2  prod-A         300         active
3  prod-A         400      cancelled
4  prod-A         500         active
5  prod-A         500         active
6  prod-A         600      cancelled
7  prod-A         600      cancelled

Also working nice with multiple categories:

print (df)
  product  product_id account_status
0  prod-A         100         active
1  prod-A         100      cancelled <- necessary remove
2  prod-A         100        pending <- necessary remove
3  prod-A         300         active
4  prod-A         300        pending <- necessary remove
5  prod-A         400      cancelled
6  prod-A         500         active
7  prod-A         500         active
8  prod-A         600        pending
9  prod-A         600      cancelled

s = df['account_status'].eq('active')
g = df.assign(A=s).groupby(['product','product_id'])['A']
mask = ~g.transform('any') | g.transform('all') | s
df = df[mask]
print (df)
  product  product_id account_status
0  prod-A         100         active
3  prod-A         300         active
5  prod-A         400      cancelled
6  prod-A         500         active
7  prod-A         500         active
8  prod-A         600        pending
9  prod-A         600      cancelled



回答2:


IMO, groupby is not necessary (I say this because you have tagged your question accordingly), you can use sort_values and drop_duplicates, taking advantage of the fact that "active" < "cancelled", lexicographically:

(df.sort_values(['account_status'])
   .drop_duplicates(['product', 'product_id'])
   .sort_index())

  product  product_id account_status
0  prod-A         100         active
2  prod-A         300         active
3  prod-A         400      cancelled

In the spirit of being consistent the other answers, you may want to take a look at groupby-based solution involving duplicated and masking.

df
  product  product_id account_status
0  prod-A         100         active
1  prod-A         100      cancelled
2  prod-A         100        pending
3  prod-A         300         active
4  prod-A         300        pending
5  prod-A         400      cancelled
6  prod-A         500         active
7  prod-A         500         active
8  prod-A         600        pending
9  prod-A         600      cancelled


m1 = (df.assign(m=df.account_status.eq('active'))
        .groupby(['product', 'product_id'])['m']
        .transform('any'))
m2 = df.duplicated(['product', 'product_id'])

df[~(m1 & m2)]

  product  product_id account_status
0  prod-A         100         active
3  prod-A         300         active
5  prod-A         400      cancelled
6  prod-A         500         active
8  prod-A         600        pending
9  prod-A         600      cancelled

Like the other solution, this also generalises "nicely" to multiple categories, and will remove rows corresponding to other statuses only in groups where "active" is also present.



来源:https://stackoverflow.com/questions/53880231/deleting-rows-based-on-values-in-other-rows

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!