Find group of consecutive dates in Pandas DataFrame

后端 未结 2 1836
抹茶落季
抹茶落季 2020-12-18 05:20

I am trying to get the chunks of data where there\'s consecutive dates from the Pandas DataFrame. My df looks like below.

      DateAnalyzed             


        
2条回答
  •  一生所求
    2020-12-18 05:41

    It seems like you need two boolean masks: one to determine the breaks between groups, and one to determine which dates are in a group in the first place.

    There's also one tricky part that can be fleshed out by example. Notice that df below contains an added row that doesn't have any consecutive dates before or after it.

    >>> df
      DateAnalyzed       Val
    1   2018-03-18  0.470253
    2   2018-03-19  0.470253
    3   2018-03-20  0.470253
    4   2017-01-20  0.485949  # < watch out for this
    5   2018-09-25  0.467729
    6   2018-09-26  0.467729
    7   2018-09-27  0.467729
    
    >>> df.dtypes
    DateAnalyzed    datetime64[ns]
    Val                    float64
    dtype: object
    

    The answer below assumes that you want to ignore 2017-01-20 completely, without processing it. (See end of answer for a solution if you do want to process this date.)

    First:

    >>> dt = df['DateAnalyzed']
    >>> day = pd.Timedelta('1d')
    >>> in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
    >>> in_block
    1     True
    2     True
    3     True
    4    False
    5     True
    6     True
    7     True
    Name: DateAnalyzed, dtype: bool
    

    Now, in_block will tell you which dates are in a "consecutive" block, but it won't tell you to which groups each date belongs.

    The next step is to derive the groupings themselves:

    >>> filt = df.loc[in_block]
    >>> breaks = filt['DateAnalyzed'].diff() != day
    >>> groups = breaks.cumsum()
    >>> groups
    1    1
    2    1
    3    1
    5    2
    6    2
    7    2
    Name: DateAnalyzed, dtype: int64
    

    Then you can call df.groupby(groups) with your operation of choice.

    >>> for _, frame in filt.groupby(groups):
    ...     print(frame, end='\n\n')
    ... 
      DateAnalyzed       Val
    1   2018-03-18  0.470253
    2   2018-03-19  0.470253
    3   2018-03-20  0.470253
    
      DateAnalyzed       Val
    5   2018-09-25  0.467729
    6   2018-09-26  0.467729
    7   2018-09-27  0.467729
    

    To incorporate this back into df, assign to it and the isolated dates will be NaN:

    >>> df['groups'] = groups
    >>> df
      DateAnalyzed       Val  groups
    1   2018-03-18  0.470253     1.0
    2   2018-03-19  0.470253     1.0
    3   2018-03-20  0.470253     1.0
    4   2017-01-20  0.485949     NaN
    5   2018-09-25  0.467729     2.0
    6   2018-09-26  0.467729     2.0
    7   2018-09-27  0.467729     2.0
    

    If you do want to include the "lone" date, things become a bit more straightforward:

    dt = df['DateAnalyzed']
    day = pd.Timedelta('1d')
    breaks = dt.diff() != day
    groups = breaks.cumsum()
    

提交回复
热议问题