Why is pandas.to_datetime slow for non standard time format such as '2014/12/31'

前端 未结 3 1934
悲哀的现实
悲哀的现实 2020-12-01 03:15

I have a .csv file in such format

timestmp, p
2014/12/31 00:31:01:9200, 0.7
2014/12/31 00:31:12:1700, 1.9
...

and when read via pd.re

3条回答
  •  [愿得一人]
    2020-12-01 04:00

    This question has already been sufficiently answered, but I wanted to add in the results of some tests I was running to optimize my own code.

    I was getting this format from an API: "Wed Feb 08 17:58:56 +0000 2017".

    Using the default pd.to_datetime(SERIES) with an implicit conversion, it was taking over an hour to process roughly 20 million rows (depending on how much free memory I had).

    That said, I tested three different conversions:

    # explicit conversion of essential information only -- parse dt str: concat
    def format_datetime_1(dt_series):
    
        def get_split_date(strdt):
            split_date = strdt.split()
            str_date = split_date[1] + ' ' + split_date[2] + ' ' + split_date[5] + ' ' + split_date[3]
            return str_date
    
        dt_series = pd.to_datetime(dt_series.apply(lambda x: get_split_date(x)), format = '%b %d %Y %H:%M:%S')
    
        return dt_series
    
    # explicit conversion of what datetime considers "essential date representation" -- parse dt str: del then join
    def format_datetime_2(dt_series):
    
        def get_split_date(strdt):
            split_date = strdt.split()
            del split_date[4]
            str_date = ' '.join(str(s) for s in split_date)
            return str_date
    
        dt_series = pd.to_datetime(dt_series.apply(lambda x: get_split_date(x)), format = '%c')
    
        return dt_series
    
    # explicit conversion of what datetime considers "essential date representation" -- parse dt str: concat
    def format_datetime_3(dt_series):
    
        def get_split_date(strdt):
            split_date = strdt.split()
            str_date = split_date[0] + ' ' + split_date[1] + ' ' + split_date[2] + ' ' + split_date[3] + ' ' + split_date[5]
            return str_date
    
        dt_series = pd.to_datetime(dt_series.apply(lambda x: get_split_date(x)), format = '%c')
    
        return dt_series
    
    # implicit conversion
    def format_datetime_baseline(dt_series):
    
        return pd.to_datetime(dt_series)
    

    This was the results:

    # sample of 250k rows
    dt_series_sample = df['created_at'][:250000]
    
    %timeit format_datetime_1(dt_series_sample)        # best of 3: 1.56 s per loop
    %timeit format_datetime_2(dt_series_sample)        # best of 3: 2.09 s per loop
    %timeit format_datetime_3(dt_series_sample)        # best of 3: 1.72 s per loop
    %timeit format_datetime_baseline(dt_series_sample) # best of 3: 1min 9s per loop
    

    The first test results in an impressive 97.7% runtime reduction!

    Somewhat surprisingly, it looks like even the "appropriate representation" takes longer, probably because it is semi-implicit.

    Conclusion: the more explicit you are, the faster it will run.

提交回复
热议问题