I have some log parsing code that needs to turn a timestamp into a datetime object. I am using datetime.strptime but this function is using a lot of cputime according to cPr
If those are fixed width formats then there is no need to parse the line - you can use slicing and a dictionary lookup to get the fields directly.
month_abbreviations = {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4,
'May': 5, 'Jun': 6, 'Jul': 7, 'Aug': 8,
'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}
year = int(line[7:11])
month = month_abbreviations[line[3:6]]
day = int(line[0:2])
hour = int(line[12:14])
minute = int(line[15:17])
second = int(line[18:20])
new_entry['time'] = datetime.datetime(year, month, day, hour, minute, second)
Testing in the manner shown by Glenn Maynard shows this to be about 3 times faster.
It seems that using strptime() on a Windows platform uses a Python implementation (_strptime.py in the Lib directory). and not a C one. It might be quicker to process the string yourself.
from datetime import datetime
import timeit
def f():
datetime.strptime ("2010-11-01", "%Y-%m-%d")
n = 100000
print "%.6f" % (timeit.timeit(f, number=n)/n)
returns 0.000049 on my system, whereas
from datetime import date
import timeit
def f():
parts = [int (x) for x in "2010-11-01".split ("-")]
return date (parts[0], parts[1], parts[2])
n = 100000
print "%.6f" % (timeit.timeit(f, number=n)/n)
returns 0.000009
Most recent answer: if moving to a straight strptime()
has not improved the running time, then my suspicion is that there is actually no problem here: you have simply written a program, one of whose main purposes in life is to call strptime()
very many times, and you have written it well enough — with so little other stuff that it does — that the strptime()
calls are quite properly being allowed to dominate the runtime. I think you could count this as a success rather than a failure, unless you find that (a) some Unicode or LANG setting is making strptime()
do extra work, or (b) you are calling it more often than you need to. Try, of course, to call it only once for each date to be parsed. :-)
Follow-up answer after seeing example date string: Wait! Hold on! Why are you parsing the line instead of just using a formatting string like:
"%d/%b/%Y:%H:%M:%S"
Original off-the-cuff-answer: If the month were a integer you could do something like this:
new_entry['time'] = datetime.datetime(
int(parsed_line['year']),
int(parsed_line['month']),
int(parsed_line['day']),
int(parsed_line['hour']),
int(parsed_line['minute']),
int(parsed_line['second'])
)
and avoid creating a big string just to make strptime()
split it back apart again. I wonder if there is a way to access the month-name logic directly to do that one textual conversion?
What's a "lot of time"? strptime
is taking about 30 microseconds here:
from datetime import datetime
import timeit
def f():
datetime.strptime("01/Nov/2010:07:49:33", "%d/%b/%Y:%H:%M:%S")
n = 100000
print "%.6f" % (timeit.timeit(f, number=n)/n)
prints 0.000031.