pandas-groupby

divide a column based on groupby or looping conditions in pandas

你说的曾经没有我的故事 提交于 2020-04-15 06:48:29
问题 I have a data frame as shown below B_ID No_Show Session slot_num Patient_count 1 0.2 S1 1 1 2 0.3 S1 2 1 3 0.8 S1 3 1 4 0.3 S1 3 2 5 0.6 S1 4 1 6 0.8 S1 5 1 7 0.9 S1 5 2 8 0.4 S1 5 3 9 0.6 S1 5 4 12 0.9 S2 1 1 13 0.5 S2 1 2 14 0.3 S2 2 1 15 0.7 S2 3 1 20 0.7 S2 4 1 16 0.6 S2 5 1 17 0.8 S2 5 2 19 0.3 S2 5 3 where No_Show = Probability of no show Assume that threshold probability = 0.2 Duration for each slot = 30 (minutes) From the above I would like calculate below data frame Step1 sort the

Include missing group keys as NaN in pandas GroupBy output

杀马特。学长 韩版系。学妹 提交于 2020-04-14 15:50:51
问题 I have a dataframe in pandas. test_df = pd.DataFrame({'date': ['2018-12-28', '2018-12-28', '2018-12-29', '2018-12-29', '2018-12-30', '2018-12-30'], 'transaction': ['aa', 'bb', 'cc', 'aa', 'bb', 'bb'], 'ccy': ['USD', 'EUR', 'EUR', 'USD', 'USD', 'USD'], 'amt': np.random.random(6)}) test_df: date transaction ccy amt 2018-12-28 aa USD 0.323439 2018-12-28 bb EUR 0.048948 2018-12-29 cc EUR 0.793263 2018-12-29 aa USD 0.013865 2018-12-30 bb USD 0.658571 2018-12-30 bb USD 0.224951 The following code

Include missing group keys as NaN in pandas GroupBy output

情到浓时终转凉″ 提交于 2020-04-14 15:45:29
问题 I have a dataframe in pandas. test_df = pd.DataFrame({'date': ['2018-12-28', '2018-12-28', '2018-12-29', '2018-12-29', '2018-12-30', '2018-12-30'], 'transaction': ['aa', 'bb', 'cc', 'aa', 'bb', 'bb'], 'ccy': ['USD', 'EUR', 'EUR', 'USD', 'USD', 'USD'], 'amt': np.random.random(6)}) test_df: date transaction ccy amt 2018-12-28 aa USD 0.323439 2018-12-28 bb EUR 0.048948 2018-12-29 cc EUR 0.793263 2018-12-29 aa USD 0.013865 2018-12-30 bb USD 0.658571 2018-12-30 bb USD 0.224951 The following code

why np.std() and pivot_table(aggfunc=np.std) return the different result

江枫思渺然 提交于 2020-04-13 08:27:56
问题 I have some code and do not understand why the difference occurs: np.std() which default ddof=0,when it's used alone. but why when it's used as an argument in pivot_table(aggfunc=np.std),it changes into ddof=1 automatically. import numpys as np import pandas as pd dft = pd.DataFrame({'A': ['one', 'one'], 'B': ['A', 'A'], 'C': ['bar', 'bar'], 'D': [-0.866740402,1.490732028]}) np.std(dft['D']) #equivalent:np.std([-0.866740402,1.490732028]) (which:defaualt ddof=0) #the result: 1.178736215 dft

why np.std() and pivot_table(aggfunc=np.std) return the different result

给你一囗甜甜゛ 提交于 2020-04-13 08:27:55
问题 I have some code and do not understand why the difference occurs: np.std() which default ddof=0,when it's used alone. but why when it's used as an argument in pivot_table(aggfunc=np.std),it changes into ddof=1 automatically. import numpys as np import pandas as pd dft = pd.DataFrame({'A': ['one', 'one'], 'B': ['A', 'A'], 'C': ['bar', 'bar'], 'D': [-0.866740402,1.490732028]}) np.std(dft['D']) #equivalent:np.std([-0.866740402,1.490732028]) (which:defaualt ddof=0) #the result: 1.178736215 dft

Dataframe cell to be locked and used for a running balance calculation (follow up question)

天涯浪子 提交于 2020-04-11 17:58:20
问题 (This is a follow up question to my previous question which was answered correctly). Say I have the following dataframe import pandas as pd df = pd.DataFrame() df['E'] = ('SIT','SCLOSE', 'SHODL', 'SHODL', 'SHODL', 'SHODL', 'SHODL', 'SHODL','SHODL','SCLOSE_BUY','BCLOSE_SELL', 'BHODL', 'BHODL', 'BHODL', 'BHODL', 'BHODL', 'BHODL','BUY','SIT','SIT') df['F'] = (0.00,1.00,10.00, 5.00,6.00,-6.00, 6.00, 2.00,10.00,10.00,-8.00,33.00,-15.00,6.00,-1.00,5.00,10.00,0.00,0.00,0.00) df.loc[19, 'G'] = 100

python: use agg with more than one customized function

元气小坏坏 提交于 2020-03-28 06:39:13
问题 I have a data frame like this. mydf = pd.DataFrame({'a':[1,1,3,3],'b':[np.nan,2,3,6],'c':[1,3,3,9]}) a b c 0 1 NaN 1 1 1 2.0 3 2 3 3.0 3 3 3 6.0 9 I would like to have a resulting dataframe like this. myResults = pd.concat([mydf.groupby('a').apply(lambda x: (x.b/x.c).max()), mydf.groupby('a').apply(lambda x: (x.b/x.c).min())], axis =1) myResults.columns = ['max','min'] max min a 1 0.666667 0.666667 3 1.000000 0.666667 Basically i would like to have max and min of ratio of column b and column

Time difference in days based on specific condition in pandas

╄→尐↘猪︶ㄣ 提交于 2020-03-25 19:16:27
问题 I have a data frame as shown below ID CONSTRUCTION_DATE START_DATE END_DATE CANCELLED_DATE 1 2016-02-06 2016-02-26 2017-02-26 NaT 1 2016-02-06 2017-03-27 2018-02-26 2017-05-22 1 2016-02-06 2017-08-27 2019-02-26 2017-10-21 1 2016-02-06 2018-07-27 2021-02-26 NaT 2 2016-05-06 2017-03-27 2018-02-26 NaT 2 2016-05-06 2018-08-27 2019-02-26 NaT Above data has to be order based on ID and START_DATE. From the above data frame I would like to prepare below dataframe ID D_from_C_to_first_S_D T_D_V_aft_c

Python 3 pandas.groupby.filter

懵懂的女人 提交于 2020-03-18 10:52:50
问题 I am trying to perform a groupby filter that is very similar to the example in this documentation: pandas groupby filter >>> df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', ... 'foo', 'bar'], ... 'B' : [1, 2, 3, 4, 5, 6], ... 'C' : [2.0, 5., 8., 1., 2., 9.]}) >>> grouped = df.groupby('A') >>> grouped.filter(lambda x: x['B'].mean() > 3.) A B C 1 bar 2 5.0 3 bar 4 1.0 5 bar 6 9.0 I am trying to return a DataFrame that has all 3 columns, but only 2 rows. Those 2 rows contain the minimum

Fill in missing dates of groupby

爷,独闯天下 提交于 2020-03-18 04:46:07
问题 Imagine I have a dataframe that looks like: ID DATE VALUE 1 31-01-2006 5 1 28-02-2006 5 1 31-05-2006 10 1 30-06-2006 11 2 31-01-2006 5 2 31-02-2006 5 2 31-03-2006 5 2 31-04-2006 5 As you can see this is panel data with multiple entries on the same date for different IDs. What I want to do is fill in missing dates for each ID. You can see that for ID "1" there is a jump in months between the second and third entry. I would like a dataframe that looks like: ID DATE VALUE 1 31-01-2006 5 1 28-02