Find group of consecutive dates in Pandas DataFrame

后端 未结 2 1831
抹茶落季
抹茶落季 2020-12-18 05:20

I am trying to get the chunks of data where there\'s consecutive dates from the Pandas DataFrame. My df looks like below.

      DateAnalyzed             


        
2条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2020-12-18 06:04

    There were similar questions after this one here and here, with more specific outputs requirements. Since this one is more general, I would like to contribute here as well.

    We can easily assign an unique identifier to consecutive groups with one-line code:

    df['grp_date'] = df.DateAnalyzed.diff().dt.days.ne(1).cumsum()
    

    Here, every time we see a date with a difference greater than a day, we add a value to that date, otherwise it remains with the previous value so that we end up with a unique identifier per group.

    See the output:

      DateAnalyzed       Val  grp_date
    1   2018-03-18  0.470253         1
    2   2018-03-19  0.470253         1
    3   2018-03-20  0.470253         1
    4   2018-09-25  0.467729         2
    5   2018-09-26  0.467729         2
    6   2018-09-27  0.467729         2
    

    Now, it's easy to groupby "grp_date" and do whatever you wanna do with apply or agg.


    Examples:

    # Sum across consecutive days (or any other method from pandas groupby)
    df.groupby('grp_date').sum()
    
    # Get the first value and last value per consecutive days
    df.groupby('grp_date').apply(lambda x: x.iloc[[0, -1]])
    # or df.groupby('grp_date').head(n) for first n days
    
    # Perform custom operation across target-columns
    df.groupby('grp_date').apply(lambda x: (x['col1'] + x['col2']) / x['Val'].mean())
    
    # Multiple operations for a target-column
    df.groupby('grp_date').Val.agg(['min', 'max', 'mean', 'std'])
    
    # and so on...
    

提交回复
热议问题