Slow pd.to_datetime()

十年热恋 提交于 2020-03-22 19:41:34

问题


I am reading two types of csv files that are very similar. They are about the same lenght, 20 000 lines. Each line represent parameters recorded each second. Thus, the first column is the timestamp.

  • In the first file, the pattern is the following: 2018-09-24 15:38
  • In the second file, the pattern is the following: 2018-09-24 03:38:06 PM

In both cases, the command is the same:

data = pd.read_csv(file)
data['Timestamp'] = pd.to_datetime(data['Timestamp'])

I check the execution time for both lines:

  • pd.read is as effective in both cases
  • it takes ~3 to 4 seconds more to execute the second line of the code

The only difference is the date pattern. I would not have suspected that. Do you know why? Do you know how to fix this?


回答1:


pandas.to_datetime is extremely slow (in certain instances) when it needs to parse the dates automatically. Since it seems like you know the formats, you should explicitly pass them to the format parameter, which will greatly improve the speed.

Here's an example:

import pandas as pd
df1 = pd.DataFrame({'Timestamp': ['2018-09-24 15:38:06']*10**5})
df2 = pd.DataFrame({'Timestamp': ['2018-09-24 03:38:06 PM']*10**5})

%timeit pd.to_datetime(df1.Timestamp)
#21 ms ± 50.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pd.to_datetime(df2.Timestamp)
#14.3 s ± 122 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That's 700x slower. Now specify the format explicitly:

%timeit pd.to_datetime(df2.Timestamp, format='%Y-%m-%d %I:%M:%S %p')
#384 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

pandas is still parsing the second date format more slowly, but it's not nearly as bad as it was before.



来源:https://stackoverflow.com/questions/52480839/slow-pd-to-datetime

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!