Extracting the first day of month of a datetime type column in pandas

前端 未结 8 1184
你的背包
你的背包 2020-12-14 06:11

I have the following dataframe:

user_id    purchase_date 
  1        2015-01-23 14:05:21
  2        2015-02-05 05:07:30
  3        2015-02-18 17:08:51
  4            


        
相关标签:
8条回答
  • 2020-12-14 06:50

    To extract the first day of every month, you could write a little helper function that will also work if the provided date is already the first of month. The function looks like this:

    def first_of_month(date):
        return date + pd.offsets.MonthEnd(-1) + pd.offsets.Day(1)
    

    You can apply this function on pd.Series:

    df['month'] = df['purchase_date'].apply(first_of_month)
    

    With that you will get the month column as a Timestamp. If you need a specific format, you might convert it with the strftime() method.

    df['month_str'] = df['month'].dt.strftime('%Y-%m-%d')
    
    0 讨论(0)
  • 2020-12-14 06:56

    Simpliest and fastest is convert to numpy array by values and then cast:

    df['month'] = df['purchase_date'].values.astype('datetime64[M]')
    print (df)
       user_id       purchase_date      month
    0        1 2015-01-23 14:05:21 2015-01-01
    1        2 2015-02-05 05:07:30 2015-02-01
    2        3 2015-02-18 17:08:51 2015-02-01
    3        4 2015-03-21 17:07:30 2015-03-01
    4        5 2015-03-11 18:32:56 2015-03-01
    5        6 2015-03-03 11:02:30 2015-03-01
    

    Another solution with floor and pd.offsets.MonthBegin(0):

    df['month'] = df['purchase_date'].dt.floor('d') - pd.offsets.MonthBegin(1)
    print (df)
       user_id       purchase_date      month
    0        1 2015-01-23 14:05:21 2015-01-01
    1        2 2015-02-05 05:07:30 2015-02-01
    2        3 2015-02-18 17:08:51 2015-02-01
    3        4 2015-03-21 17:07:30 2015-03-01
    4        5 2015-03-11 18:32:56 2015-03-01
    5        6 2015-03-03 11:02:30 2015-03-01
    

    df['month'] = (df['purchase_date'] - pd.offsets.MonthBegin(1)).dt.floor('d')
    print (df)
       user_id       purchase_date      month
    0        1 2015-01-23 14:05:21 2015-01-01
    1        2 2015-02-05 05:07:30 2015-02-01
    2        3 2015-02-18 17:08:51 2015-02-01
    3        4 2015-03-21 17:07:30 2015-03-01
    4        5 2015-03-11 18:32:56 2015-03-01
    5        6 2015-03-03 11:02:30 2015-03-01
    

    Last solution is create month period by to_period:

    df['month'] = df['purchase_date'].dt.to_period('M')
    print (df)
       user_id       purchase_date   month
    0        1 2015-01-23 14:05:21 2015-01
    1        2 2015-02-05 05:07:30 2015-02
    2        3 2015-02-18 17:08:51 2015-02
    3        4 2015-03-21 17:07:30 2015-03
    4        5 2015-03-11 18:32:56 2015-03
    5        6 2015-03-03 11:02:30 2015-03
    

    ... and then to datetimes by to_timestamp, but it is a bit slowier:

    df['month'] = df['purchase_date'].dt.to_period('M').dt.to_timestamp()
    print (df)
       user_id       purchase_date      month
    0        1 2015-01-23 14:05:21 2015-01-01
    1        2 2015-02-05 05:07:30 2015-02-01
    2        3 2015-02-18 17:08:51 2015-02-01
    3        4 2015-03-21 17:07:30 2015-03-01
    4        5 2015-03-11 18:32:56 2015-03-01
    5        6 2015-03-03 11:02:30 2015-03-01
    

    There are many solutions, so:

    Timings:

    rng = pd.date_range('1980-04-03 15:41:12', periods=100000, freq='20H')
    df = pd.DataFrame({'purchase_date': rng})  
    print (df.head())
    
    In [300]: %timeit df['month1'] = df['purchase_date'].values.astype('datetime64[M]')
    100 loops, best of 3: 9.2 ms per loop
    
    In [301]: %timeit df['month2'] = df['purchase_date'].dt.floor('d') - pd.offsets.MonthBegin(1)
    100 loops, best of 3: 15.9 ms per loop
    
    In [302]: %timeit df['month3'] = (df['purchase_date'] - pd.offsets.MonthBegin(1)).dt.floor('d')
    100 loops, best of 3: 12.8 ms per loop
    
    In [303]: %timeit df['month4'] = df['purchase_date'].dt.to_period('M').dt.to_timestamp()
    1 loop, best of 3: 399 ms per loop
    
    #MaxU solution
    In [304]: %timeit df['month5'] = df['purchase_date'].dt.normalize() - pd.offsets.MonthBegin(1)
    10 loops, best of 3: 24.9 ms per loop
    
    #MaxU solution 2
    In [305]: %timeit df['month'] = df['purchase_date'] - pd.offsets.MonthBegin(1, normalize=True)
    10 loops, best of 3: 28.9 ms per loop
    
    #Wen solution
    In [306]: %timeit df['month6']= pd.to_datetime(df.purchase_date.astype(str).str[0:7]+'-01')
    1 loop, best of 3: 214 ms per loop
    
    0 讨论(0)
提交回复
热议问题