问题
I am reading a big csv file which has about 1B rows. I ran into a issue with parsing the date. Python is slow in the processing.
a single line in the file looks like the following,
'20170427,20:52:01.510,ABC,USD/MXN,1,OFFER,19.04274,9000000,9@15@8653948257753368229,0.0\n'
if I only look through the data, it takes 1 minute.
t0 = datetime.datetime.now()
i = 0
with open(r"QuoteData.txt") as file:
for line in file:
i+=1
print(i)
t1 = datetime.datetime.now() - t0
print(t1)
129908976
0:01:09.871744
But if I tried to parse the datetime, it will take 8 minutes.
t0 = datetime.datetime.now()
i = 0
with open(r"D:\FxQuotes\ticks.log.20170427.txt") as file:
for line in file:
strings = line.split(",")
datetime.datetime(
int(strings[0][0:4]), # %Y
int(strings[0][4:6]), # %m
int(strings[0][6:8]), # %d
int(strings[1][0:2]), # %H
int(strings[1][3:5]), # %M
int(strings[1][6:8]), # %s
int(strings[1][9:]), # %f
)
i+=1
print(i)
t1 = datetime.datetime.now() - t0
print(t1)
129908976
0:08:13.687000
The split()
takes about 1 minute, and the date parsing takes about 6 minutes. Is there anything I could do to improve this?
回答1:
@TemporalWolf had the excellent suggestion of using ciso8601. I've never heard of it so I figured I'd give it a try.
First, I benchmarked my laptop with your sample line. I made a CSV file with 10 million rows of that exact line and it took about 6 seconds to read everything. Using your date parsing code brought that up to 48 seconds which made sense because you also reported it taking 8x longer. Then I scaled the file down to 1 million rows and I could read it in 0.6 seconds and parse dates in 4.8 seconds so everything looked right.
Then I switched over to ciso8601
and, almost like magic, the time for 1 million rows went from 4.8 seconds to about 1.9 seconds:
import datetime
import ciso8601
t0 = datetime.datetime.now()
i = 0
with open('input.csv') as file:
for line in file:
strings = line.split(",")
d = ciso8601.parse_datetime('%sT%s' % (strings[0], strings[1]))
i+=1
print(i)
t1 = datetime.datetime.now() - t0
print(t1)
Note that your data is almost in iso8601 format already. I just had to stick the date and time together with a "T" in the middle.
来源:https://stackoverflow.com/questions/43726661/is-there-a-way-to-improve-speed-of-parsing-date-for-large-file