Select DataFrame rows between two dates

前端 未结 10 873
挽巷
挽巷 2020-11-22 03:14

I am creating a DataFrame from a csv as follows:

stock = pd.read_csv(\'data_in/\' + filename + \'.csv\', skipinitialspace=True)

The DataFra

10条回答
  •  执念已碎
    2020-11-22 03:53

    With my testing of pandas version 0.22.0 you can now answer this question easier with more readable code by simply using between.

    # create a single column DataFrame with dates going from Jan 1st 2018 to Jan 1st 2019
    df = pd.DataFrame({'dates':pd.date_range('2018-01-01','2019-01-01')})
    

    Let's say you want to grab the dates between Nov 27th 2018 and Jan 15th 2019:

    # use the between statement to get a boolean mask
    df['dates'].between('2018-11-27','2019-01-15', inclusive=False)
    
    0    False
    1    False
    2    False
    3    False
    4    False
    
    # you can pass this boolean mask straight to loc
    df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=False)]
    
        dates
    331 2018-11-28
    332 2018-11-29
    333 2018-11-30
    334 2018-12-01
    335 2018-12-02
    

    Notice the inclusive argument. very helpful when you want to be explicit about your range. notice when set to True we return Nov 27th of 2018 as well:

    df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)]
    
        dates
    330 2018-11-27
    331 2018-11-28
    332 2018-11-29
    333 2018-11-30
    334 2018-12-01
    

    This method is also faster than the previously mentioned isin method:

    %%timeit -n 5
    df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)]
    868 µs ± 164 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
    
    
    %%timeit -n 5
    
    df.loc[df['dates'].isin(pd.date_range('2018-01-01','2019-01-01'))]
    1.53 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
    

    However, it is not faster than the currently accepted answer, provided by unutbu, only if the mask is already created. but if the mask is dynamic and needs to be reassigned over and over, my method may be more efficient:

    # already create the mask THEN time the function
    
    start_date = dt.datetime(2018,11,27)
    end_date = dt.datetime(2019,1,15)
    mask = (df['dates'] > start_date) & (df['dates'] <= end_date)
    
    %%timeit -n 5
    df.loc[mask]
    191 µs ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
    

提交回复
热议问题