pandas-groupby | 易学教程

Pandas: how to groupby based on series pattern

阅读更多关于 Pandas: how to groupby based on series pattern

问题 Having the following df: pd.DataFrame({'bool':[True,True,True, False,True,True,True], 'foo':[1,3,2,6,2,4,7]}) which results into: bool foo 0 True 1 1 True 3 2 True 2 3 False 6 4 True 2 5 True 4 6 True 7 how to groupby Trues into 2 groups, to have indexes [0:2] in group 1 , and [4:6] in group 2 ? The desired output: group1: bool foo 0 True 1 1 True 3 2 True 2 group2: 4 True 2 5 True 4 6 True 7 Thank you! 回答1: you could do : import numpy as np x = df[df["bool"]].index.values groups = np.split(x

how to calculate sum|mean|median for tail of each group when pandas data aggregated in python

阅读更多关于 how to calculate sum|mean|median for tail of each group when pandas data aggregated in python

问题 i am having data like following.which is in pandas data frame format. A B C D E F G 1 1 2 3 1 4 2 1 1 2 4 5 6 7 1 1 2 3 2 3 2 1 1 2 4 5 6 7 2 1 2 3 2 3 4 2 1 2 3 4 3 3 2 1 2 4 5 6 7 here agg_lvl=['A','B','C'] I want to calculate mean|median|sum for G variable by using tail(2) records in each group when data aggregated to agg_lvl. And my expected output is like this: expected output for mean: A B C G 1 1 2 4.5 2 1 2 5 the output will be same for median and sum also,but in place of mean we have

Determine change in values in a grouped dataframe

阅读更多关于 Determine change in values in a grouped dataframe

问题 Assume a dataset like this (which originally is read in from a .csv): data = pd.DataFrame({'id': [1,2,3,1,2,3], 'time':['2017-01-01 12:00:00','2017-01-01 12:00:00','2017-01-01 12:00:00', '2017-01-01 12:10:00','2017-01-01 12:10:00','2017-01-01 12:10:00'], 'value': [10,11,12,10,12,13]}) => id time value 0 1 2017-01-01 12:00:00 10 1 2 2017-01-01 12:00:00 11 2 3 2017-01-01 12:00:00 12 3 1 2017-01-01 12:10:00 10 4 2 2017-01-01 12:10:00 12 5 3 2017-01-01 12:10:00 13 Time is identical for all IDs in

numpy unique could not filter out groups with the same value on a specific column

阅读更多关于 numpy unique could not filter out groups with the same value on a specific column

问题 I tried to groupby a df and then select groups who do not have the same value on a specific column and whose group size > 1, df.groupby(['account_no', 'ext_id', 'amount']).filter(lambda x: (len(x) > 1) & (np.unique(x.int_id).size != 1)) the df looks like, note that some account_no strings only have a single space, ext_id and int_id are also strings, amount is float ; account_no ext_id amount int_id 2665057 439.504062 D000192 2665057 439.504062 D000192 353724 2758.92 952 353724 2758.92 952 the

What is the pythonic way of collapsing values into a set for multiple columns per each group in pandas dataframes?

阅读更多关于 What is the pythonic way of collapsing values into a set for multiple columns per each group in pandas dataframes?

问题 Given a dataframe, collapsing values into a set per group for a column is straightforward: df.groupby('A')['B'].apply(set) But how do you do it in a pythonic way if you want to do it on multiple columns and the result to be in a dataframe? For example for the following dataframe: import pandas as pd df = pd.DataFrame({'user_id': [1, 2, 3, 4, 1, 2, 3], 'class_type': ['Krav Maga', 'Yoga', 'Ju-jitsu', 'Krav Maga', 'Ju-jitsu','Krav Maga', 'Karate'], 'instructor': ['Bob', 'Alice','Bob', 'Alice',

Slicing a DataGrameGroupBy object

阅读更多关于 Slicing a DataGrameGroupBy object

问题 Is there a way to slice a DataFrameGroupBy object? For example, if I have: df = pd.DataFrame({'A': [2, 1, 1, 3, 3], 'B': ['x', 'y', 'z', 'r', 'p']}) A B 0 2 x 1 1 y 2 1 z 3 3 r 4 3 p dfg = df.groupby('A') Now, the returned GroupBy object is indexed by values from A, and I would like to select a subset of it, e.g. to perform aggregation. It could be something like dfg.loc[1:2].agg(...) or, for a specific column, dfg['B'].loc[1:2].agg(...) EDIT. To make it more clear: by slicing the GroupBy

printing the top 2 of frequently occurred values of the target column

阅读更多关于 printing the top 2 of frequently occurred values of the target column

问题 I have three columns like shown below, and trying to return top1 and top2 highest count of the third column. I want this output to be generated as shown in the expected output . DATA : print (df) AGE GENDER rating 0 10 M PG 1 10 M R 2 10 M R 3 4 F PG13 4 4 F PG13 CODE : s = (df.groupby(['AGE', 'GENDER'])['rating'] .apply(lambda x: x.value_counts().head(2)) .rename_axis(('a','b', 'c')) .reset_index(level=2)['c']) output : print (s) a b 4 F PG13 10 M R M PG Name: c, dtype: object EXPECTED

pandas groupby mean with nan

阅读更多关于 pandas groupby mean with nan

问题 I have the following dataframe: date id cars 2012 1 4 2013 1 6 2014 1 NaN 2012 2 10 2013 2 20 2014 2 NaN Now, I want to get the mean of cars over the years for each id ignoring the NaN's. The result should be like this: date id cars result 2012 1 4 5 2013 1 6 5 2014 1 NaN 5 2012 2 10 15 2013 2 20 15 2014 2 NaN 15 I have the following command: df["result"]=df.groupby("id")["cars"].mean() The command runs without errors, but the result column only has NaN's. What did I do wrong? 回答1: Use

How can I ignore empty series when using value_counts on a Pandas groupby?

阅读更多关于 How can I ignore empty series when using value_counts on a Pandas groupby?

问题 I've got a DataFrame with the metadata for a newspaper article in each row. I'd like to group these into monthly chunks, then count the values of one column (called type ): monthly_articles = articles.groupby(pd.Grouper(freq="M")) monthly_articles = monthly_articles["type"].value_counts().unstack() This works fine with an annual group but fails when I try to group by month: ValueError: operands could not be broadcast together with shape (141,) (139,) I think this is because there are some

ValueError: Grouper and axis must be same length

阅读更多关于 ValueError: Grouper and axis must be same length

问题 I have a dataframe with 38 columns, one of them is Time. I established a bin frame space timeframe=['4-6','7-9','10-12','13-15','16-18','19-21','22-24' ] bins = [3,6,9,12,15,18,21,24] Now I cut: frameddata=pd.cut(df['time'],bins,retbins=True, labels=timeframe) and want to group the df for different bins: groups=df.groupby(frameddata) here I get the following error: ValueError: Grouper and axis must be same length Any help on this? 回答1: I believe need create new column: df['bins'] = pd.cut(df[