pandas-groupby

Pandas : Sum multiple columns and get results in multiple columns

陌路散爱 提交于 2019-11-30 20:13:19
问题 I have a "sample.txt" like this. idx A B C D cat J 1 2 3 1 x K 4 5 6 2 x L 7 8 9 3 y M 1 2 3 4 y N 4 5 6 5 z O 7 8 9 6 z With this dataset, I want to get sum in row and column. In row, it is not a big deal. I made result like this. ### MY CODE ### import pandas as pd df = pd.read_csv('sample.txt',sep="\t",index_col='idx') df.info() df2 = df.groupby('cat').sum() print( df2 ) The result is like this. A B C D cat x 5 7 9 3 y 8 10 12 7 z 11 13 15 11 But I don't know how to write a code to get

How to calculate vwap (volume weighted average price) using groupby and apply?

拥有回忆 提交于 2019-11-30 15:46:13
I have read multiple post similar to my question, but I still can't figure it out. I have a pandas df that looks like the following (for multiple days): Out[1]: price quantity time 2016-06-08 09:00:22 32.30 1960.0 2016-06-08 09:00:22 32.30 142.0 2016-06-08 09:00:22 32.30 3857.0 2016-06-08 09:00:22 32.30 1000.0 2016-06-08 09:00:22 32.35 991.0 2016-06-08 09:00:22 32.30 447.0 ... To calculate the vwap I could do: df['vwap'] = (np.cumsum(df.quantity * df.price) / np.cumsum(df.quantity)) However, I would like to start over every day (groupby), but I can't figure out how to make it work with a

Pandas groupby multiple columns, list of multiple columns

烈酒焚心 提交于 2019-11-30 14:17:54
问题 I have the following data: Invoice NoStockCode Description Quantity CustomerID Country 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 17850 United Kingdom 536365 71053 WHITE METAL LANTERN 6 17850 United Kingdom 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 17850 United Kingdom I am trying to do a groupby so i have the following operation: df.groupby(['InvoiceNo','CustomerID','Country'])['NoStockCode','Description','Quantity'].apply(list) I want to get the output |Invoice |CustomerID

Speed up Pandas cummin/cummax

微笑、不失礼 提交于 2019-11-30 13:49:40
Pandas cummin and cummax functions seem to be really slow for my use case with many groups. How can I speed them up? Update import pandas as pd import numpy as np from collections import defaultdict def cummax(g, v): df1 = pd.DataFrame(g, columns=['group']) df2 = pd.DataFrame(v) df = pd.concat([df1, df2], axis=1) result = df.groupby('group').cummax() result = result.values return result def transform(g, v): df1 = pd.DataFrame(g, columns=['group']) df2 = pd.DataFrame(v) df = pd.concat([df1, df2], axis=1) result = df.groupby('group').transform(lambda x: x.cummax()) result = result.values return

Pandas groupby multiple columns, list of multiple columns

半世苍凉 提交于 2019-11-30 10:29:31
I have the following data: Invoice NoStockCode Description Quantity CustomerID Country 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 17850 United Kingdom 536365 71053 WHITE METAL LANTERN 6 17850 United Kingdom 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 17850 United Kingdom I am trying to do a groupby so i have the following operation: df.groupby(['InvoiceNo','CustomerID','Country'])['NoStockCode','Description','Quantity'].apply(list) I want to get the output |Invoice |CustomerID |Country |NoStockCode |Description |Quantity |536365| |17850 |United Kingdom |85123A, 71053, 84406B |WHITE

What is the pandas equivalent of dplyr summarize/aggregate by multiple functions?

倾然丶 夕夏残阳落幕 提交于 2019-11-30 10:15:42
问题 I'm having issues transitioning to pandas from R where dplyr package can easily group-by and perform multiple summarizations. Please help improve my existing Python pandas code for multiple aggregations: import pandas as pd data = pd.DataFrame( {'col1':[1,1,1,1,1,2,2,2,2,2], 'col2':[1,2,3,4,5,6,7,8,9,0], 'col3':[-1,-2,-3,-4,-5,-6,-7,-8,-9,0] } ) result = [] for k,v in data.groupby('col1'): result.append([k, max(v['col2']), min(v['col3'])]) print pd.DataFrame(result, columns=['col1', 'col2_agg

Create new columns from aggregated categories

亡梦爱人 提交于 2019-11-30 09:47:50
问题 I have a dataframe looks like: SK_ID_CURR CREDIT_ACTIVE 0 215354 Closed 1 215354 Active 2 215354 Active 3 215354 Active 4 215354 Active 5 215354 Active 6 215354 Active 7 162297 Closed 8 162297 Closed 9 162297 Active I would like to aggregate the number of active and closed credits for each id, and then make a new column for Active_credits , Closed_credits with the number of corresponding active and closed credits for each id. 回答1: You can use pandas.crosstab, which avoids your suggested

Pandas groupby to to_csv

微笑、不失礼 提交于 2019-11-30 08:37:08
问题 Want to output a Pandas groupby dataframe to CSV. Tried various StackOverflow solutions but they have not worked. Python 3.6.1, Pandas 0.20.1 groupby result looks like: id month year count week 0 9066 82 32142 895 1 7679 84 30112 749 2 8368 126 42187 872 3 11038 102 34165 976 4 8815 117 34122 767 5 10979 163 50225 1252 6 8726 142 38159 996 7 5568 63 26143 582 Want a csv that looks like week count 0 895 1 749 2 872 3 976 4 767 5 1252 6 996 7 582 Current code: week_grouped = df.groupby('week')

Use Pandas groupby() + apply() with arguments

懵懂的女人 提交于 2019-11-30 04:44:56
I would like to use df.groupby() in combination with apply() to apply a function to each row per group. I normally use the following code, which usually works (note, that this is without groupby() ): df.apply(myFunction, args=(arg1,)) With the groupby() I tried the following: df.groupby('columnName').apply(myFunction, args=(arg1,)) However, I get the following error: TypeError: myFunction() got an unexpected keyword argument 'args' Hence, my question is: How can I use groupby() and apply() with a function that needs arguments? pandas.core.groupby.GroupBy.apply does NOT have named parameter

What is the pandas equivalent of dplyr summarize/aggregate by multiple functions?

柔情痞子 提交于 2019-11-29 19:53:09
I'm having issues transitioning to pandas from R where dplyr package can easily group-by and perform multiple summarizations. Please help improve my existing Python pandas code for multiple aggregations: import pandas as pd data = pd.DataFrame( {'col1':[1,1,1,1,1,2,2,2,2,2], 'col2':[1,2,3,4,5,6,7,8,9,0], 'col3':[-1,-2,-3,-4,-5,-6,-7,-8,-9,0] } ) result = [] for k,v in data.groupby('col1'): result.append([k, max(v['col2']), min(v['col3'])]) print pd.DataFrame(result, columns=['col1', 'col2_agg', 'col3_agg']) Issues: too verbose probably can be optimized and efficient. (I rewrote a for-loop