问题
I have a dataframe that is grouped by 'Key'. I need to compare rows within each group to identify whether I want to keep each row of the group or whether I want just one row of a group.
In the condition to keep all rows of a group: if there is one row that has the color 'red' and area of '12' and shape of 'circle' AND another row (within the same group) that has a color of 'green' and an area of '13' and shape of 'square', then I want to keep all rows in that group. Otherwise if this scenario does not exist, I want to keep the row of that group with the largest 'num' value.
df = pd.DataFrame({'KEY': ['100000009', '100000009', '100000009', '100000009', '100000009','100000034','100000034', '100000034'],
'Date1': [20120506, 20120506, 20120507,20120608,20120620,20120206,20120306,20120405],
'shape': ['circle', 'square', 'circle','circle','circle','circle','circle','circle'],
'num': [3,4,5,6,7,8,9,10],
'area': [12, 13, 12,12,12,12,12,12],
'color': ['red', 'green', 'red','red','red','red','red','red']})
Date1 KEY area color num shape
0 2012-05-06 100000009 12 red 3 circle
1 2012-05-06 100000009 13 green 4 square
2 2012-05-07 100000009 12 red 5 circle
3 2012-06-08 100000009 12 red 6 circle
4 2012-06-20 100000009 12 red 7 circle
5 2012-02-06 100000034 12 red 8 circle
6 2012-03-06 100000034 12 red 9 circle
7 2012-04-05 100000034 12 red 10 circle
Expected result:
Date1 KEY area color num shape
0 2012-05-06 100000009 12 red 3 circle
1 2012-05-06 100000009 13 green 4 square
2 2012-05-07 100000009 12 red 5 circle
3 2012-06-08 100000009 12 red 6 circle
4 2012-06-20 100000009 12 red 7 circle
7 2012-04-05 100000034 12 red 10 circle
I am new to python, and groupby is throwing me a curve ball.
maxnum = df.groupby('KEY')['num'].transform(max)
df = df.loc[df.num == maxnum]
cond1 = (df[df['area'] == 12]) & (df[df['color'] == 'red']) & (df[df['shape'] == 'circle'])
cond2 = (df[df['area'] == 13]) & (df[df['color'] == 'green']) & (df[df['shape'] == 'square'])
回答1:
Define a custom function called function:
def function(x):
i = x.query(
'area == 12 and color == "red" and shape == "circle"'
)
j = x.query(
'area == 13 and color == "green" and shape == "square"'
)
return x if not (i.empty or j.empty) else x[x.num == x.num.max()].head(1)
This function tests each group on the specified conditions and returns rows as appropriate. In particular, it queries on the conditions and tests for emptiness using df.empty.
Pass this to groupby + apply:
df.groupby('KEY', group_keys=False).apply(function)
Date1 KEY area color num shape
0 20120506 100000009 12 red 3 circle
1 20120506 100000009 13 green 4 square
2 20120507 100000009 12 red 5 circle
3 20120608 100000009 12 red 6 circle
4 20120620 100000009 12 red 7 circle
7 20120405 100000034 12 red 10 circle
来源:https://stackoverflow.com/questions/48819644/pandas-comparing-rows-within-groups