Why is pandas.to_datetime slow for non standard time format such as '2014/12/31'

前端 未结 3 1936
悲哀的现实
悲哀的现实 2020-12-01 03:15

I have a .csv file in such format

timestmp, p
2014/12/31 00:31:01:9200, 0.7
2014/12/31 00:31:12:1700, 1.9
...

and when read via pd.re

3条回答
  •  南方客
    南方客 (楼主)
    2020-12-01 04:15

    This is because pandas falls back to dateutil.parser.parse for parsing the strings when it has a non-default format or when no format string is supplied (this is much more flexible, but also slower).

    As you have shown above, you can improve the performance by supplying a format string to to_datetime. Or another option is to use infer_datetime_format=True


    Apparently, the infer_datetime_format cannot infer when there are microseconds. With an example without those, you can see a large speed-up:

    In [28]: d = '2014-12-24 01:02:03'
    
    In [29]: c = re.sub('-', '/', d)
    
    In [30]: s_c = pd.Series([c]*10000)
    
    In [31]: %timeit pd.to_datetime(s_c)
    1 loops, best of 3: 1.14 s per loop
    
    In [32]: %timeit pd.to_datetime(s_c, infer_datetime_format=True)
    10 loops, best of 3: 105 ms per loop
    
    In [33]: %timeit pd.to_datetime(s_c, format="%Y/%m/%d %H:%M:%S")
    10 loops, best of 3: 99.5 ms per loop
    

提交回复
热议问题