I have a .csv file in such format
timestmp, p
2014/12/31 00:31:01:9200, 0.7
2014/12/31 00:31:12:1700, 1.9
...
and when read via pd.re
Often I am unable to specify a standard date format ahead of time because I simply do not know how each client will choose to submit it. The dates are unpredictably formatted and often missing.
In these cases, instead of using pd.to_datetime, I have found it more efficient to write my own wrapper to dateutil.parser.parse:
import pandas as pd
from dateutil.parser import parse
import numpy as np
def parseDateStr(s):
if s != '':
try:
return np.datetime64(parse(s))
except ValueError:
return np.datetime64('NaT')
else: return np.datetime64('NaT')
# Example data:
someSeries=pd.Series( ['NotADate','','1-APR-16']*10000 )
# Compare times:
%timeit pd.to_datetime(someSeries, errors='coerce') #1 loop, best of 3: 1.78 s per loop
%timeit someSeries.apply(parseDateStr) #1 loop, best of 3: 904 ms per loop
# The approaches return identical results:
someSeries.apply(parseDateStr).equals(pd.to_datetime(someSeries, errors='coerce')) # True
In this case the runtime is cut in half, but YMMV.