pandas-groupby

Best Way to add group totals to a dataframe in Pandas

这一生的挚爱 提交于 2019-11-28 00:11:53
I have a simple task that I'm wondering if there is a better / more efficient way to do. I have a dataframe that looks like this: Group Score Count 0 A 5 100 1 A 1 50 2 A 3 5 3 B 1 40 4 B 2 20 5 B 1 60 And I want to add a column that holds the value of the group total count: Group Score Count TotalCount 0 A 5 100 155 1 A 1 50 155 2 A 3 5 155 3 B 1 40 120 4 B 2 20 120 5 B 1 60 120 The way I did this was: Grouped=df.groupby('Group')['Count'].sum().reset_index() Grouped=Grouped.rename(columns={'Count':'TotalCount'}) df=pd.merge(df, Grouped, on='Group', how='left') Is there a better / cleaner way

Select the max row per group - pandas performance issue

心不动则不痛 提交于 2019-11-27 23:17:01
I'm selecting one max row per group and I'm using groupby / agg to return index values and select the rows using loc . For example, to group by "Id" and then select the row with the highest "delta" value: selected_idx = df.groupby("Id").apply(lambda df: df.delta.argmax()) selected_rows = df.loc[selected_idx, :] However, it's so slow this way. Actually, my i7/16G RAM laptop hangs when I'm using this query on 13 million rows. I have two questions for experts: How can I make this query run fast in pandas? What am I doing wrong? Why is this operation so expensive? [Update] Thank you so much for

How to get number of groups in a groupby object in pandas?

懵懂的女人 提交于 2019-11-27 22:55:22
问题 This would be useful so I know how many unique groups I have to perform calculations on. Thank you. Suppose groupby object is called dfgroup . 回答1: As documented, you can get the number of groups with len(dfgroup) . 回答2: As of v0.23, there are a multiple options to use. First, the setup, df = pd.DataFrame({'A': list('aabbcccd'), 'B': 'x'}) df A B 0 a x 1 a x 2 b x 3 b x 4 c x 5 c x 6 c x 7 d x g = df.groupby(['A']) 1) ngroups Newer versions of the groupby API provide this (undocumented)

Pandas - dataframe groupby - how to get sum of multiple columns

你说的曾经没有我的故事 提交于 2019-11-27 22:13:39
This should be an easy one, but somehow I couldn't find a solution that works. I have a pandas dataframe which looks like this: index col1 col2 col3 col4 col5 0 a c 1 2 f 1 a c 1 2 f 2 a d 1 2 f 3 b d 1 2 g 4 b e 1 2 g 5 b e 1 2 g I want to group by col1 and col2 and get the sum() of col3 and col4. Col5 can be dropped, since the data can not be aggregated. Here is how the output should look like. I am interested in having both col3 and col4 in the resulting dataframe. It doesn't really matter if col1 and col2 are part of the index or not. index col1 col2 col3 col4 0 a c 2 4 1 a d 1 2 2 b d 1 2

Regression by group in python pandas

一曲冷凌霜 提交于 2019-11-27 15:36:24
问题 I want to ask a quick question related to regression analysis in python pandas. So, assume that I have the following datasets: Group Y X 1 10 6 1 5 4 1 3 1 2 4 6 2 2 4 2 3 9 My aim is to run regression; Y is dependent and X is independent variable. The issue is I want to run this regression by Group and print the coefficients in a new data set. So, the results should be like: Group Coefficient 1 0.25 (lets assume that coefficient is 0.25) 2 0.30 I hope I can explain my question. Many thanks

pandas groupby where you get the max of one column and the min of another column

不打扰是莪最后的温柔 提交于 2019-11-27 15:13:27
I have a dataframe as follows: user num1 num2 a 1 1 a 2 2 a 3 3 b 4 4 b 5 5 I want a dataframe which has the minimum from num1 for each user, and the maximum of num2 for each user. The output should be like: user num1 num2 a 1 3 b 4 5 I know that if I wanted the max of both columns I could just do: a.groupby('user')['num1', 'num2'].max() Is there some equivalent without having to do something like: series_1 = a.groupby('user')['num1'].min() series_2 = a.groupby('user')['num2'].max() # converting from series to df so I can do a join on user df_1 = pd.DataFrame(np.array([series_1]).transpose(),

Python Pandas Group by date using datetime data

荒凉一梦 提交于 2019-11-27 11:12:38
问题 I have a column Date_Time that I wish to groupby date time without creating a new column. Is this possible the current code I have does not work. df = pd.groupby(df,by=[df['Date_Time'].date()]) 回答1: resample df.resample('D', on='Date_Time').mean() B Date_Time 2001-10-01 4.5 2001-10-02 6.0 Grouper As suggested by @JosephCottam df.set_index('Date_Time').groupby(pd.Grouper(freq='D')).mean() B Date_Time 2001-10-01 4.5 2001-10-02 6.0 Deprecated uses of TimeGrouper You can set the index to be 'Date

Python - rolling functions for GroupBy object

空扰寡人 提交于 2019-11-27 10:16:47
问题 I have a time series object grouped of the type <pandas.core.groupby.SeriesGroupBy object at 0x03F1A9F0> . grouped.sum() gives the desired result but I cannot get rolling_sum to work with the groupby object. Is there any way to apply rolling functions to groupby objects? For example: x = range(0, 6) id = ['a', 'a', 'a', 'b', 'b', 'b'] df = DataFrame(zip(id, x), columns = ['id', 'x']) df.groupby('id').sum() id x a 3 b 12 However, I would like to have something like: id x 0 a 0 1 a 1 2 a 3 3 b

How to drop duplicates based on two or more subsets criteria in Pandas data-frame

孤街醉人 提交于 2019-11-27 09:51:09
Lets say this is my data-frame df = pd.DataFrame({ 'bio' : ['1', '1', '1', '4'], 'center' : ['one', 'one', 'two', 'three'], 'outcome' : ['f','t','f','f'] }) It looks like this ... bio center outcome 0 1 one f 1 1 one t 2 1 two f 3 4 three f I want to drop row 1 because it has the same bio & center as row 0. I want to keep row 2 because it has the same bio but different center then row 0. Something like this won't work based on drop_duplicates input structure but it's what I am trying to do df.drop_duplicates(subset = 'bio' & subset = 'center' ) Any suggestions ? edit : changed df a bit to fit

Python Pandas Conditional Sum with Groupby

余生长醉 提交于 2019-11-27 08:19:49
Using sample data: df = pd.DataFrame({'key1' : ['a','a','b','b','a'], 'key2' : ['one', 'two', 'one', 'two', 'one'], 'data1' : np.random.randn(5), 'data2' : np. random.randn(5)}) df data1 data2 key1 key2 0 0.361601 0.375297 a one 1 0.069889 0.809772 a two 2 1.468194 0.272929 b one 3 -1.138458 0.865060 b two 4 -0.268210 1.250340 a one I'm trying to figure out how to group the data by key1 and sum only the data1 values where key2 equals 'one'. Here's what I've tried def f(d,a,b): d.ix[d[a] == b, 'data1'].sum() df.groupby(['key1']).apply(f, a = 'key2', b = 'one').reset_index() But this gives me a