Count Number of Rows Between Two Dates BY ID in a Pandas GroupBy Dataframe

后端 未结 3 1824
一个人的身影
一个人的身影 2020-12-10 08:46

I have the following test DataFrame:

import random
from datetime import timedelta
import pandas as pd
import datetime

#create test range of dates
rng=pd.dat         


        
相关标签:
3条回答
  • 2020-12-10 09:01

    Here is a solution I came up with (this will loop through the permutations of unique cid's and date range getting your counts):

    from itertools import product
    df_new_date=pd.DataFrame(list(product(df.cid.unique(),pd.date_range(df.stdt.min(), df.enddt.max()))),columns=['cid','newdate'])
    df_new_date['cnt']=df_new_date.apply(lambda row:df[(df['cid']==row['cid'])&(df['stdt']<=row['newdate'])&(df['enddt']>=row['newdate'])]['jid'].count(),axis=1)
    
    >>> df_new_date.head(20) 
        cid    newdate  cnt
    0     1 2015-07-01    0
    1     1 2015-07-02    0
    2     1 2015-07-03    0
    3     1 2015-07-04    0
    4     1 2015-07-05    0
    5     1 2015-07-06    1
    6     1 2015-07-07    1
    7     1 2015-07-08    1
    8     1 2015-07-09    1
    9     1 2015-07-10    1
    10    1 2015-07-11    2
    11    1 2015-07-12    3
    12    1 2015-07-13    3
    13    1 2015-07-14    2
    14    1 2015-07-15    3
    15    1 2015-07-16    3
    16    1 2015-07-17    3
    17    1 2015-07-18    3
    18    1 2015-07-19    2
    19    1 2015-07-20    1
    

    You could then drop the zeros if you don't want them. I don't think this will be much better than your original solution, however.

    I would like to suggest you use the following improvement on the loop provided by the @DSM solution:

    df_parts=[]
    for cid in df.cid.unique():
        full_count=df[(df.cid==cid)][['cid','date','count']].set_index("date").asfreq("D", method='ffill')[['cid','count']].reset_index()
        df_parts.append(full_count[full_count['count']!=0])
    
    df_new = pd.concat(df_parts)
    
    >>> df_new
             date  cid  count
    0  2015-07-06    1      1
    1  2015-07-07    1      1
    2  2015-07-08    1      1
    3  2015-07-09    1      1
    4  2015-07-10    1      1
    5  2015-07-11    1      2
    6  2015-07-12    1      3
    7  2015-07-13    1      3
    8  2015-07-14    1      2
    9  2015-07-15    1      3
    10 2015-07-16    1      3
    11 2015-07-17    1      3
    12 2015-07-18    1      3
    13 2015-07-19    1      2
    14 2015-07-20    1      1
    15 2015-07-21    1      1
    16 2015-07-22    1      1
    0  2015-07-01    2      1
    1  2015-07-02    2      1
    2  2015-07-03    2      1
    3  2015-07-04    2      1
    4  2015-07-05    2      1
    5  2015-07-06    2      1
    6  2015-07-07    2      2
    7  2015-07-08    2      2
    8  2015-07-09    2      2
    9  2015-07-10    2      3
    10 2015-07-11    2      3
    11 2015-07-12    2      4
    12 2015-07-13    2      4
    13 2015-07-14    2      5
    14 2015-07-15    2      4
    15 2015-07-16    2      4
    16 2015-07-17    2      3
    17 2015-07-18    2      2
    18 2015-07-19    2      2
    19 2015-07-20    2      1
    20 2015-07-21    2      1
    

    Only real improvement over what @DSM provided is that this will avoid requiring the creation of a groubby object for the loop and this will also get you all the min stdt and max enddt per cid number with no zero values.

    0 讨论(0)
  • 2020-12-10 09:08

    My usual approach for these problems is to pivot and think in terms of events changing an accumulator. Every new "stdt" we see adds +1 to the count; every "enddt" we see adds -1. (Adds -1 the next day, at least if I'm interpreting "between" the way you are. Some days I think we should ban the use of the word as too ambiguous..)

    IOW, if we turn your frame to something like

    >>> df.head()
        cid  jid  change       date
    0     1  100       1 2015-01-06
    1     1  101       1 2015-01-07
    21    1  100      -1 2015-01-16
    22    1  101      -1 2015-01-17
    17    1  117       1 2015-03-01
    

    then what we want is simply the cumulative sum of change (after suitable regrouping.) For example, something like

    df["enddt"] += timedelta(days=1)
    df = pd.melt(df, id_vars=["cid", "jid"], var_name="change", value_name="date")
    df["change"] = df["change"].replace({"stdt": 1, "enddt": -1})
    df = df.sort(["cid", "date"])
    
    df = df.groupby(["cid", "date"],as_index=False)["change"].sum()
    df["count"] = df.groupby("cid")["change"].cumsum()
    
    new_time = pd.date_range(df.date.min(), df.date.max())
    
    df_parts = []
    for cid, group in df.groupby("cid"):
        full_count = group[["date", "count"]].set_index("date")
        full_count = full_count.reindex(new_time)
        full_count = full_count.ffill().fillna(0)
        full_count["cid"] = cid
        df_parts.append(full_count)
    
    df_new = pd.concat(df_parts)
    

    which gives me something like

    >>> df_new.head(15)
                count  cid
    2015-01-03      0    1
    2015-01-04      0    1
    2015-01-05      0    1
    2015-01-06      1    1
    2015-01-07      2    1
    2015-01-08      2    1
    2015-01-09      2    1
    2015-01-10      2    1
    2015-01-11      2    1
    2015-01-12      2    1
    2015-01-13      2    1
    2015-01-14      2    1
    2015-01-15      2    1
    2015-01-16      1    1
    2015-01-17      0    1
    

    There may be off-by-one differences with regards to your expectations; you may have different ideas about how you should handle multiple overlapping jids in the same time window (here they would count as 2); but the basic idea of working with the events should prove useful even if you have to tweak the details.

    0 讨论(0)
  • 2020-12-10 09:09

    0

    I have a df dataframe containing the start date and the end date of each event example:

    start     end
    08:08:20  08:09:20
    08:08:11  08:13:99
    08:09:15  08:10:50
    08:11:10  08:12:00
    08:11:10  08:13:00
    
    

    I want to have the number of simultaneous events each minute: I generate a dataframe df1 conennat every minute possible between min start and max end and what i do is : if df.date_fin> df.Time and df.date_deb

    my code is :

    df["nb_events"]=0
    
    for i in range (0,df1.shape[0]):
        for j in range (0,df.shape[0]):
            if  df.end[j]>df1.Time[i]:  
                if df.start[j]<df1.Time[i]:
                    df1["nb_events"][i]+=1
    

    desired results df1:

    Time              nb_event
    .
    .
    .
    08:08:00            2
    08:09:00            2
    08:10:00            1
    08:11:00            2
    08:12:00            3
    08:13:00            1
    .
    .
    .
    

    my code is functional and it returns the desired results except I have a large amount of data to process and it takes a long time to run Can you offer me another way to do it? thank you

    0 讨论(0)
提交回复
热议问题