Why is pandas.to_datetime slow for non standard time format such as '2014/12/31'

前端 未结 3 1941
悲哀的现实
悲哀的现实 2020-12-01 03:15

I have a .csv file in such format

timestmp, p
2014/12/31 00:31:01:9200, 0.7
2014/12/31 00:31:12:1700, 1.9
...

and when read via pd.re

3条回答
  •  长情又很酷
    2020-12-01 04:01

    Often I am unable to specify a standard date format ahead of time because I simply do not know how each client will choose to submit it. The dates are unpredictably formatted and often missing.

    In these cases, instead of using pd.to_datetime, I have found it more efficient to write my own wrapper to dateutil.parser.parse:

    import pandas as pd
    from dateutil.parser import parse
    import numpy as np
    
    def parseDateStr(s):
        if s != '':
            try:
                return np.datetime64(parse(s))
            except ValueError:
                return np.datetime64('NaT')
        else: return np.datetime64('NaT')             
    
    # Example data:
    someSeries=pd.Series(  ['NotADate','','1-APR-16']*10000 )
    
    # Compare times:
    %timeit pd.to_datetime(someSeries, errors='coerce') #1 loop, best of 3: 1.78 s per loop
    %timeit someSeries.apply(parseDateStr)              #1 loop, best of 3: 904 ms per loop
    
    # The approaches return identical results:
    someSeries.apply(parseDateStr).equals(pd.to_datetime(someSeries, errors='coerce')) # True
    

    In this case the runtime is cut in half, but YMMV.

提交回复
热议问题