pandas-groupby | 易学教程

Best Way to add group totals to a dataframe in Pandas

阅读更多关于 Best Way to add group totals to a dataframe in Pandas

I have a simple task that I'm wondering if there is a better / more efficient way to do. I have a dataframe that looks like this: Group Score Count 0 A 5 100 1 A 1 50 2 A 3 5 3 B 1 40 4 B 2 20 5 B 1 60 And I want to add a column that holds the value of the group total count: Group Score Count TotalCount 0 A 5 100 155 1 A 1 50 155 2 A 3 5 155 3 B 1 40 120 4 B 2 20 120 5 B 1 60 120 The way I did this was: Grouped=df.groupby('Group')['Count'].sum().reset_index() Grouped=Grouped.rename(columns={'Count':'TotalCount'}) df=pd.merge(df, Grouped, on='Group', how='left') Is there a better / cleaner way

Select the max row per group - pandas performance issue

阅读更多关于 Select the max row per group - pandas performance issue

I'm selecting one max row per group and I'm using groupby / agg to return index values and select the rows using loc . For example, to group by "Id" and then select the row with the highest "delta" value: selected_idx = df.groupby("Id").apply(lambda df: df.delta.argmax()) selected_rows = df.loc[selected_idx, :] However, it's so slow this way. Actually, my i7/16G RAM laptop hangs when I'm using this query on 13 million rows. I have two questions for experts: How can I make this query run fast in pandas? What am I doing wrong? Why is this operation so expensive? [Update] Thank you so much for

How to get number of groups in a groupby object in pandas?

阅读更多关于 How to get number of groups in a groupby object in pandas?

问题 This would be useful so I know how many unique groups I have to perform calculations on. Thank you. Suppose groupby object is called dfgroup . 回答1: As documented, you can get the number of groups with len(dfgroup) . 回答2: As of v0.23, there are a multiple options to use. First, the setup, df = pd.DataFrame({'A': list('aabbcccd'), 'B': 'x'}) df A B 0 a x 1 a x 2 b x 3 b x 4 c x 5 c x 6 c x 7 d x g = df.groupby(['A']) 1) ngroups Newer versions of the groupby API provide this (undocumented)

Pandas - dataframe groupby - how to get sum of multiple columns

阅读更多关于 Pandas - dataframe groupby - how to get sum of multiple columns

This should be an easy one, but somehow I couldn't find a solution that works. I have a pandas dataframe which looks like this: index col1 col2 col3 col4 col5 0 a c 1 2 f 1 a c 1 2 f 2 a d 1 2 f 3 b d 1 2 g 4 b e 1 2 g 5 b e 1 2 g I want to group by col1 and col2 and get the sum() of col3 and col4. Col5 can be dropped, since the data can not be aggregated. Here is how the output should look like. I am interested in having both col3 and col4 in the resulting dataframe. It doesn't really matter if col1 and col2 are part of the index or not. index col1 col2 col3 col4 0 a c 2 4 1 a d 1 2 2 b d 1 2

Regression by group in python pandas

阅读更多关于 Regression by group in python pandas

问题 I want to ask a quick question related to regression analysis in python pandas. So, assume that I have the following datasets: Group Y X 1 10 6 1 5 4 1 3 1 2 4 6 2 2 4 2 3 9 My aim is to run regression; Y is dependent and X is independent variable. The issue is I want to run this regression by Group and print the coefficients in a new data set. So, the results should be like: Group Coefficient 1 0.25 (lets assume that coefficient is 0.25) 2 0.30 I hope I can explain my question. Many thanks

pandas groupby where you get the max of one column and the min of another column

阅读更多关于 pandas groupby where you get the max of one column and the min of another column

I have a dataframe as follows: user num1 num2 a 1 1 a 2 2 a 3 3 b 4 4 b 5 5 I want a dataframe which has the minimum from num1 for each user, and the maximum of num2 for each user. The output should be like: user num1 num2 a 1 3 b 4 5 I know that if I wanted the max of both columns I could just do: a.groupby('user')['num1', 'num2'].max() Is there some equivalent without having to do something like: series_1 = a.groupby('user')['num1'].min() series_2 = a.groupby('user')['num2'].max() # converting from series to df so I can do a join on user df_1 = pd.DataFrame(np.array([series_1]).transpose(),

Python Pandas Group by date using datetime data

阅读更多关于 Python Pandas Group by date using datetime data

问题 I have a column Date_Time that I wish to groupby date time without creating a new column. Is this possible the current code I have does not work. df = pd.groupby(df,by=[df['Date_Time'].date()]) 回答1: resample df.resample('D', on='Date_Time').mean() B Date_Time 2001-10-01 4.5 2001-10-02 6.0 Grouper As suggested by @JosephCottam df.set_index('Date_Time').groupby(pd.Grouper(freq='D')).mean() B Date_Time 2001-10-01 4.5 2001-10-02 6.0 Deprecated uses of TimeGrouper You can set the index to be 'Date

Python - rolling functions for GroupBy object

阅读更多关于 Python - rolling functions for GroupBy object

问题 I have a time series object grouped of the type <pandas.core.groupby.SeriesGroupBy object at 0x03F1A9F0> . grouped.sum() gives the desired result but I cannot get rolling_sum to work with the groupby object. Is there any way to apply rolling functions to groupby objects? For example: x = range(0, 6) id = ['a', 'a', 'a', 'b', 'b', 'b'] df = DataFrame(zip(id, x), columns = ['id', 'x']) df.groupby('id').sum() id x a 3 b 12 However, I would like to have something like: id x 0 a 0 1 a 1 2 a 3 3 b

How to drop duplicates based on two or more subsets criteria in Pandas data-frame

阅读更多关于 How to drop duplicates based on two or more subsets criteria in Pandas data-frame

Lets say this is my data-frame df = pd.DataFrame({ 'bio' : ['1', '1', '1', '4'], 'center' : ['one', 'one', 'two', 'three'], 'outcome' : ['f','t','f','f'] }) It looks like this ... bio center outcome 0 1 one f 1 1 one t 2 1 two f 3 4 three f I want to drop row 1 because it has the same bio & center as row 0. I want to keep row 2 because it has the same bio but different center then row 0. Something like this won't work based on drop_duplicates input structure but it's what I am trying to do df.drop_duplicates(subset = 'bio' & subset = 'center' ) Any suggestions ? edit : changed df a bit to fit

Python Pandas Conditional Sum with Groupby

阅读更多关于 Python Pandas Conditional Sum with Groupby

Using sample data: df = pd.DataFrame({'key1' : ['a','a','b','b','a'], 'key2' : ['one', 'two', 'one', 'two', 'one'], 'data1' : np.random.randn(5), 'data2' : np. random.randn(5)}) df data1 data2 key1 key2 0 0.361601 0.375297 a one 1 0.069889 0.809772 a two 2 1.468194 0.272929 b one 3 -1.138458 0.865060 b two 4 -0.268210 1.250340 a one I'm trying to figure out how to group the data by key1 and sum only the data1 values where key2 equals 'one'. Here's what I've tried def f(d,a,b): d.ix[d[a] == b, 'data1'].sum() df.groupby(['key1']).apply(f, a = 'key2', b = 'one').reset_index() But this gives me a