How to remove some rows in a group by in python

送分小仙女□ 提交于 2020-04-18 09:55:07

问题


I'm having a dataframe and I'd like to do a groupby() based a column and then sort the values within each group based on a date column. Then, from a each I'd like to remove records whose value for column_condition == 'B' until I reach to a row whose column_condition == 'A'. For example, Assume the table below is one of the groups

ID, DATE, column_condition
--------------------------
1, jan 2017, B
1, Feb 2017, B
1, Mar 2017, B
1, Aug 2017, A
1, Sept 2017, B

So, I'd like to remove the first three rows and leave this group with only the last two rows. How can I do that?


回答1:


I think I finally understand your question: you wish to groupby a dataframe by 'ID', sort by date, and keep the rows after the first ocurrence of 'A' in your condition column. I've come up with the following one liner solution:

Setting up dummy data

import pandas as pd
import datetime as dt

d = {
    'ID': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2], # Assuming only two unique IDs for simplicity
    'DATE': [ # Dates already sorted, but it would work anyways
        dt.date(2018, 7, 19), dt.date(2018, 8, 18),
        dt.date(2018, 9, 17), dt.date(2018, 10, 17),
        dt.date(2018, 11, 16), dt.date(2018, 7, 19),
        dt.date(2018, 8, 18), dt.date(2018, 9, 17),
        dt.date(2018, 10, 17), dt.date(2018, 11, 16)
    ],
    'condition': ['B', 'B', 'B', 'A', 'B', 'B', 'B', 'B', 'A', 'B']
}
# 'DATE' but with list comprehension: 
# [dt.date.today() + dt.timedelta(days=30*x) for y in range(0, 2) for x in range(0, 5)]
df = pd.DataFrame(d)

Interpreter

>>> (df.sort_values(by='DATE') # we should call pd.to_datetime() first if...
...     .groupby('ID') # 'DATE' is not datetime already
...     .apply(lambda x: x[(x['condition'].values == 'A').argmax():]))

      ID        DATE condition
ID
1  3   1  2018-10-17         A
   4   1  2018-11-16         B
2  8   2  2018-10-17         A
   9   2  2018-11-16         B

You can also call reset_index(drop=True), if you need something like this:

   ID        DATE condition
0   1  2018-10-17         A
1   1  2018-11-16         B
2   2  2018-10-17         A
3   2  2018-11-16         B

(x['condition'].values == 'A') returns a bool np.array, and calling argmax() gives us then index where the first ocurrence of True happens (where condition == 'A' in this case). Using that index, we're subsetting each of the groups with a slice.

EDIT: Added filter for dealing with groups that only contain the undesired condition.

d = {
    'ID': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2], # Assuming only two unique IDs for simplicity
    'DATE': [ # Dates already sorted, but it would work anyways
        dt.date(2018, 7, 19), dt.date(2018, 8, 18),
        dt.date(2018, 9, 17), dt.date(2018, 10, 17),
        dt.date(2018, 11, 16), dt.date(2018, 7, 19),
        dt.date(2018, 8, 18), dt.date(2018, 9, 17),
        dt.date(2018, 10, 17), dt.date(2018, 11, 16)
    ], # ID 1 only contains 'B'
    'condition': ['B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'A', 'B']
}
df = pd.DataFrame(d)

Interpreter

>>> df
   ID        DATE condition
0   1  2018-07-19         B
1   1  2018-08-18         B
2   1  2018-09-17         B
3   1  2018-10-17         B
4   1  2018-11-16         B
5   2  2018-07-19         B
6   2  2018-08-18         B
7   2  2018-09-17         B
8   2  2018-10-17         A
9   2  2018-11-16         B

>>> (df.sort_values(by='DATE')
...    .groupby('ID')
...    .filter(lambda x: (x['condition'] == 'A').any())
...    .groupby('ID')
...    .apply(lambda x: x[(x['condition'].values == 'A').argmax():]))

     ID        DATE condition
ID
2  8   2  2018-10-17         A
   9   2  2018-11-16         B



回答2:


I'm not sure if there is a cleaner way to do this but I believe the following should work:

for index, row in df.iterrows():
    if row["column_condition"] == "A":
        break
    elif row["column_condition"] == "B":
        df.drop(index, inplace=True)


来源:https://stackoverflow.com/questions/51424613/how-to-remove-some-rows-in-a-group-by-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!