Slow pd.to_datetime() | 易学教程

问题

I am reading two types of csv files that are very similar. They are about the same lenght, 20 000 lines. Each line represent parameters recorded each second. Thus, the first column is the timestamp.

In the first file, the pattern is the following: 2018-09-24 15:38
In the second file, the pattern is the following: 2018-09-24 03:38:06 PM

In both cases, the command is the same:

data = pd.read_csv(file)
data['Timestamp'] = pd.to_datetime(data['Timestamp'])

I check the execution time for both lines:

pd.read is as effective in both cases
it takes ~3 to 4 seconds more to execute the second line of the code

The only difference is the date pattern. I would not have suspected that. Do you know why? Do you know how to fix this?

回答1:

pandas.to_datetime is extremely slow (in certain instances) when it needs to parse the dates automatically. Since it seems like you know the formats, you should explicitly pass them to the format parameter, which will greatly improve the speed.

Here's an example:

import pandas as pd
df1 = pd.DataFrame({'Timestamp': ['2018-09-24 15:38:06']*10**5})
df2 = pd.DataFrame({'Timestamp': ['2018-09-24 03:38:06 PM']*10**5})

%timeit pd.to_datetime(df1.Timestamp)
#21 ms ± 50.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pd.to_datetime(df2.Timestamp)
#14.3 s ± 122 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That's 700x slower. Now specify the format explicitly:

%timeit pd.to_datetime(df2.Timestamp, format='%Y-%m-%d %I:%M:%S %p')
#384 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

pandas is still parsing the second date format more slowly, but it's not nearly as bad as it was before.

来源：https://stackoverflow.com/questions/52480839/slow-pd-to-datetime

标签

python

pandas

string-to-datetime