pandas-groupby

Speed up Pandas cummin/cummax

|▌冷眼眸甩不掉的悲伤 提交于 2019-11-29 19:20:16
问题 Pandas cummin and cummax functions seem to be really slow for my use case with many groups. How can I speed them up? Update import pandas as pd import numpy as np from collections import defaultdict def cummax(g, v): df1 = pd.DataFrame(g, columns=['group']) df2 = pd.DataFrame(v) df = pd.concat([df1, df2], axis=1) result = df.groupby('group').cummax() result = result.values return result def transform(g, v): df1 = pd.DataFrame(g, columns=['group']) df2 = pd.DataFrame(v) df = pd.concat([df1,

Compare preceding two rows with subsequent two rows of each group till last record

牧云@^-^@ 提交于 2019-11-29 18:08:05
I had a question earlier which is deleted and now modified to a less verbose form for you to read easily. I have a dataframe as given below df = pd.DataFrame({'subject_id' :[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],'day':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] , 'PEEP' :[7,5,10,10,11,11,14,14,17,17,21,21,23,23,25,25,22,20,26,26,5,7,8,8,9,9,13,13,15,15,12,12,15,15,19,19,19,22,22,15]}) df['fake_flag'] = '' I would like to fill values in column fake_flag based on the below rules 1) if preceding

what is different between groupby.first, groupby.nth, groupby.head when as_index=False

一世执手 提交于 2019-11-29 17:40:54
Edit: the rookie mistake I made in string np.nan having pointed out by @coldspeed, @wen-ben, @ALollz. Answers are quite good, so I don't delete this question to keep those answers. Original: I have read this question/answer What's the difference between groupby.first() and groupby.head(1)? That answer explained that the differences are on handling NaN value. However, when I call groupby with as_index=False , they both pick NaN fine. Furthermore, Pandas has groupby.nth with similar functionality to head , and first What are difference of groupby.first(), groupby.nth(0), groupby.head(1) with as

DataFrame: add column with the size of a group

╄→гoц情女王★ 提交于 2019-11-29 17:37:36
问题 I have the following dataframe: fsq digits digits_type 0 1 1 odd 1 2 1 odd 2 3 1 odd 3 11 2 even 4 22 2 even 5 101 3 odd 6 111 3 odd and I want to add a last column, count , containing the number of fsq belonging to the digits group, i.e: fsq digits digits_type count 0 1 1 odd 3 1 2 1 odd 3 2 3 1 odd 3 3 11 2 even 2 4 22 2 even 2 5 101 3 odd 2 6 111 3 odd 2 Since there are 3 fsq rows that has digits equal to 1, 2 fsq rows that has digits equal to 2, etc. 回答1: In [395]: df['count'] = df

How to get minimum of each group for each day based on hour criteria

半世苍凉 提交于 2019-11-29 16:48:57
I have given two dataframes below for you to test df = pd.DataFrame({ 'subject_id':[1,1,1,1,1,1,1,1,1,1,1], 'time_1' :['2173-04-03 12:35:00','2173-04-03 17:00:00','2173-04-03 20:00:00','2173-04-04 11:00:00','2173-04-04 11:30:00','2173-04-04 12:00:00','2173-04-05 16:00:00','2173-04-05 22:00:00','2173-04-06 04:00:00','2173-04-06 04:30:00','2173-04-06 06:30:00'], 'val' :[5,5,5,10,5,10,5,8,3,8,10] }) df1 = pd.DataFrame({ 'subject_id':[1,1,1,1,1,1,1,1,1,1,1], 'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-03 12:59:00','2173-04-03 13:14:00','2173-04-03 13:37:00','2173-04-04 11:30:00

How to calculate vwap (volume weighted average price) using groupby and apply?

我们两清 提交于 2019-11-29 15:34:15
问题 I have read multiple post similar to my question, but I still can't figure it out. I have a pandas df that looks like the following (for multiple days): Out[1]: price quantity time 2016-06-08 09:00:22 32.30 1960.0 2016-06-08 09:00:22 32.30 142.0 2016-06-08 09:00:22 32.30 3857.0 2016-06-08 09:00:22 32.30 1000.0 2016-06-08 09:00:22 32.35 991.0 2016-06-08 09:00:22 32.30 447.0 ... To calculate the vwap I could do: df['vwap'] = (np.cumsum(df.quantity * df.price) / np.cumsum(df.quantity)) However,

group data by season according to the exact dates

六月ゝ 毕业季﹏ 提交于 2019-11-29 11:35:58
i have a csv file containing 4 years of data and i am trying to group data per season over the 4 years , differently saying, i need to summarize and plot my whole data into 4 season only . here's a look on my data file : timestamp,heure,lat,lon,impact,type 2006-01-01 00:00:00,13:58:43,33.837,-9.205,10.3,1 2006-01-02 00:00:00,00:07:28,34.5293,-10.2384,17.7,1 2007-02-01 00:00:00,23:01:03,35.0617,-1.435,-17.1,2 2007-02-02 00:00:00,01:14:29,36.5685,0.9043,36.8,1 2008-01-01 00:00:00,05:03:51,34.1919,-12.5061,-48.9,1 2008-01-02 00:00:00,05:03:51,34.1919,-12.5061,-48.9,1 .... 2011-12-31 00:00:00,05

Pandas Dataframe: how to add column with number of occurrences in other column

青春壹個敷衍的年華 提交于 2019-11-29 11:25:02
I have to following df: Col1 Col2 test Something test2 Something test3 Something test Something test2 Something test5 Something I want to get Col1 Col2 Occur test Something 2 test2 Something 2 test3 Something 1 test Something 2 test2 Something 2 test5 Something 1 I've tried to use: df["Occur"] = df["Col1"].value_counts() But it didn't help. I've got Occur column full of 'NaN' groupby on 'col1' and then apply transform on Col2 to return a Series with its index aligned to the original df so you can add it as a column: In [3]: df['Occur'] = df.groupby('Col1')['Col2'].transform(pd.Series.value

Why doesn't first and last in a groupby give me first and last

試著忘記壹切 提交于 2019-11-29 10:48:51
I'm posting this because the topic just got brought up in another question/answer and the behavior isn't very well documented. Consider the dataframe df df = pd.DataFrame(dict( A=list('xxxyyy'), B=[np.nan, 1, 2, 3, 4, np.nan] )) A B 0 x NaN 1 x 1.0 2 x 2.0 3 y 3.0 4 y 4.0 5 y NaN I wanted to get the first and last rows of each group defined by column 'A' . I tried df.groupby('A').B.agg(['first', 'last']) first last A x 1.0 2.0 y 3.0 4.0 However, This doesn't give me the np.NaN s that I expected. How do I get the actual first and last values in each group? One option is to use the .nth method:

Python Pandas Sum Values in Columns If date between 2 dates

我怕爱的太早我们不能终老 提交于 2019-11-29 09:56:52
问题 I have a dataframe df which can be created with this: data={'id':[1,1,1,1,2,2,2,2], 'date1':[datetime.date(2016,1,1),datetime.date(2016,1,2),datetime.date(2016,1,3),datetime.date(2016,1,4), datetime.date(2016,1,2),datetime.date(2016,1,4),datetime.date(2016,1,3),datetime.date(2016,1,1)], 'date2':[datetime.date(2016,1,5),datetime.date(2016,1,3),datetime.date(2016,1,5),datetime.date(2016,1,5), datetime.date(2016,1,4),datetime.date(2016,1,5),datetime.date(2016,1,4),datetime.date(2016,1,1)],