How to get minimum of each group for each day based on hour criteria

前端 未结 4 1491
北海茫月
北海茫月 2020-12-22 01:29

I have given two dataframes below for you to test

df = pd.DataFrame({
    \'subject_id\':[1,1,1,1,1,1,1,1,1,1,1],
    \'time_1\' :[\'2173-04-03 12:35:00\',\'         


        
4条回答
  •  暖寄归人
    2020-12-22 02:34

    I came up with an approach like below and it is working. Any suggestions are welcome

    s=pd.to_timedelta(24,unit='h')-(df.time_1-df.time_1.dt.normalize())
    df['tdiff'] = df.groupby(df.time_1.dt.date).time_1.diff().shift(-1).fillna(s)
    df['t_d'] = df['tdiff'].dt.total_seconds()/3600
    df['hr'] = df['time_1'].dt.hour
    df['date'] = df['time_1'].dt.date
    df['day'] = pd.DatetimeIndex(df['time_1']).day
    
    # here I get the freq and cumsum of each val for each day and each hour. Since sort = 'False', timeorder is retained as is
    
    temp_1 = pd.DataFrame(df.groupby(['subject_id','date','hr','val'], sort=False)['t_d'].agg({'cumduration':sum,'freq':'count'}).reset_index())
    
    # here i remove the `hour` component and sum the value duration in same day but different hours (for example `5` was in 12th hour and 13th hour. we sum them)
    
    temp_2 = pd.DataFrame(temp_1.groupby(['subject_id','date','val'], sort=False)['cumduration'].agg({'sum_of_cumduration':sum,'freq':'count'}).reset_index())
    
    # Later, I create a mask for `> 1` hr criteria  
    
    mask = temp_2.groupby(['subject_id','date'])['sum_of_cumduration'].apply(lambda x: x > 1)
    output_1 = pd.DataFrame(temp_2[mask].groupby(['subject_id','date'])['val'].min()).reset_index()
    
     # I check for `< 1 ` hr records here 
    
    output_2 = pd.DataFrame(temp_2[~mask].groupby(['subject_id','date'])['val'].min()).reset_index()
    
     # I finally check for `subject_id` and `date` and then append
    output = output_1.append(output_2[~output_2['subject_id'].isin(output_1['subject_id'])])
    
    output
    

提交回复
热议问题