Inferring date format versus passing a parser

荒凉一梦 提交于 2020-01-02 00:55:24

问题


Pandas internals question: I've been surprised to find a few times that explicitly passing a callable to date_parser within pandas.read_csv results in much slower read time than simply using infer_datetime_format=True.

Why is this? Will timing differences between these two options be date-format-specific, or what other factors will influence their relative timing?

In the below case, infer_datetime_format=True takes one-tenth the time of passing a date parser with a specified format. I would have naively assumed the latter would be faster because it's explicit.

The docs do note,

[if True,] pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.

but there's not much detail given and I was unable to work my way fully through the source.

Setup:

from io import StringIO

import numpy as np
import pandas as pd

np.random.seed(444)
dates = pd.date_range('1980', '2018')
df = pd.DataFrame(np.random.randint(0, 100, (len(dates), 2)),
                  index=dates).add_prefix('col').reset_index()

# Something reproducible to be read back in
buf = StringIO()
df.to_string(buf=buf, index=False)

def read_test(**kwargs):
    # Not ideal for .seek() to eat up runtime, but alleviate
    # this with more loops than needed in timing below
    buf.seek(0)
    return pd.read_csv(buf, sep='\s+', parse_dates=['index'], **kwargs)

# dateutil.parser.parser called in this case, according to docs
%timeit -r 7 -n 100 read_test()
18.1 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit -r 7 -n 100 read_test(infer_datetime_format=True)
19.8 ms ± 516 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Doesn't change with native Python datetime.strptime either
%timeit -r 7 -n 100 read_test(date_parser=lambda dt: pd.datetime.strptime(dt, '%Y-%m-%d'))
187 ms ± 4.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

I'm interested in knowing a bit about what is going on internally with infer to give it this advantage. My old understanding was that there was already some type of inference going on in the first place because dateutil.parser.parser is used if neither is passed.


Update: did some digging on this but haven't been able to answer the question.

read_csv() calls a helper function which in turn calls pd.core.tools.datetimes.to_datetime(). That function (accessible as just pd.to_datetime()) has both an infer_datetime_format and a format argument.

However, in this case, the relative timings are very different and don't reflect the above:

s = pd.Series(['3/11/2000', '3/12/2000', '3/13/2000']*1000)

%timeit pd.to_datetime(s,infer_datetime_format=True)
19.8 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pd.to_datetime(s,infer_datetime_format=False)
1.01 s ± 65.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# This was taking the longest with i/o functions,
# now it's behaving "as expected"
%timeit pd.to_datetime(s,format='%m/%d/%Y')
19 ms ± 373 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

回答1:


You've identified the two important functions: read_csv prepares a function to parse the date columns using _make_date_converter, and this is always going make a call to_datetime (pandas' primary string-to-date conversion tool).

The answers by @WillAyd and @bmbigbang both seem correct to me in that they identify the cause of the slowness as repeated calls of the lambda function.

Since you ask for more details about pandas' source code, I'll try and examine each read_test call in more detail below to find out where we end up in to_datetime and ultimately why the timings are as you observed for your data.


read_test()

This is very fast because, without any prompts about a possible date format, pandas is going to try and parse the list-like column of strings as though they're approximately in ISO8601 format (which is a very common case).

Plunging into to_datetime, we quickly reach this code branch:

if result is None and (format is None or infer_datetime_format):
    result = tslib.array_to_datetime(...)

From here on, it's compiled Cython code all the way.

array_to_datetime iterates through the column of strings to convert each one to datetime format. For each row, we hit _string_to_dts at this line; then we go to another short snippet of inlined code (_cstring_to_dts) which means parse_iso_8601_datetime is called to do the actual parsing of the string to a datetime object.

This function is more than capable of parsing dates in the YYYY-MM-DD format and so there is just some housekeeping to finish the job (the C struct filled by parse_iso_8601_datetime becomes a proper datetime object, some bounds are checked).

As you can see, dateutil.parser.parser is not called at all.


read_test(infer_datetime_format=True)

Let's see why this is almost as fast as read_test().

Asking pandas to infer the datetime format (and passing no format argument) means we land here in to_datetime:

if infer_datetime_format and format is None:
    format = _guess_datetime_format_for_array(arg, dayfirst=dayfirst)

This calls _guess_datetime_format_for_array, which takes the first non-null value in the column and gives it to _guess_datetime_format. This tries to build a datetime format string to use for future parsing. (My answer here has more detail above the formats it is able recognise.)

Fortunately, the YYYY-MM-DD format is one that can be recognised by this function. Even more fortunately, this particular format has a fast-path through the pandas code!

You can see pandas sets infer_datetime_format back to False here:

if format is not None:
    # There is a special fast-path for iso8601 formatted
    # datetime strings, so in those cases don't use the inferred
    # format because this path makes process slower in this
    # special case
    format_is_iso8601 = _format_is_iso(format)
    if format_is_iso8601:
        require_iso8601 = not infer_datetime_format
        format = None

This allows the code to take the same path as above to the parse_iso_8601_datetime function.


read_test(date_parser=lambda dt: strptime(dt, '%Y-%m-%d'))

We've provided a function to parse the date with, so pandas executes this code block.

However, this raises as exception internally:

strptime() argument 1 must be str, not numpy.ndarray

This exception is immediately caught, and pandas falls back to using try_parse_dates before calling to_datetime.

try_parse_dates means that instead of being called on an array, the lambda function is called repeatedly for each value of the array in this loop:

for i from 0 <= i < n:
    if values[i] == '':
        result[i] = np.nan
    else:
        result[i] = parse_date(values[i]) # parse_date is the lambda function

Despite being compiled code, we pay the penalty of having function calls to Python code. This makes it very slow in comparison to the other approaches above.

Back in to_datetime, we now have an object array filled with datetime objects. Again we hit array_to_datetime, but this time pandas sees a date object and uses another function (pydate_to_dt64) to make it into a datetime64 object.

The cause of the slowdown is really due to the repeated calls to the lambda function.


About your update and the MM/DD/YYYY format

The Series s has date strings in the MM/DD/YYYY format.

This is not an ISO8601 format. pd.to_datetime(s, infer_datetime_format=False) tries to parse the string using parse_iso_8601_datetime but this fails with a ValueError. The error is handled here: pandas is going to use parse_datetime_string instead. This means that dateutil.parser.parse is used to convert the string to datetime. This is why it is slow in this case: repeated use of a Python function in a loop.

There's not much difference between pd.to_datetime(s, format='%m/%d/%Y') and pd.to_datetime(s, infer_datetime_format=True) in terms of speed. The latter uses _guess_datetime_format_for_array again to infer the MM/DD/YYYY format. Both then hit array_strptime here:

if format is not None:
    ...
    if result is None:
        try:
            result = array_strptime(arg, format, exact=exact, errors=errors)

array_strptime is a fast Cython function for parsing an array of strings to datetime structs given a specific format.




回答2:


infer_datetime_format applies the format it deduces to all elements in a single pass (i.e. in a vectorized manner). By contrast, your lambda function gets called for each element, bringing with it a lot more call overhead and decreased performance.

There is a request from 2012 for an explicit format keyword argument, which would theoretically give you the functionality you are looking for with the performance you desire. In lieu of that being implemented, your best bet is to go with Wes's suggested approach in the link and simply read the date in as a string, calling pd.to_datetime after the fact

Here are sample times from my machine to illustrate:

%timeit read_test()
15.4 ms ± 96.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit read_test(infer_datetime_format=True)
17.2 ms ± 1.82 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit read_test(date_parser=lambda dt: pd.datetime.strptime(dt, '%Y-%m-%d'))
147 ms ± 4.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df = read_test(); df['index'] = pd.to_datetime(df['index'], '%Y-%m-%d')
15.3 ms ± 239 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)



回答3:


After paying closer attention to Pandas original pull request for infer format

I believe the passing of the parser directly is not compatible with pandas' ._convert_listlike method. Whereas dateutil's parser is not inherently parallelizable, pandas's convert list like operations can action the parsing in parallel if they can infer the format. The post mentions the format is inferred from the first item, then all others get the same treatment. Being a mathematician, I would probably recommend taking 10 random entries and taking the majority parsed format as the picked one.

Update As mentioned in the comments, and within the pull request, there are also comparisons between passing the format or the parser: test gist. It might be worth a feature request to let the parser be a np.vectorize object or similar.



来源:https://stackoverflow.com/questions/48728307/inferring-date-format-versus-passing-a-parser

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!