Numpy vectorization messes up data type (2)

前端 未结 3 1119
小蘑菇
小蘑菇 2020-12-21 06:22

I\'m having unwanted behaviour come out of np.vectorize, namely, it changes the datatype of the argument going into the original function. My original question

3条回答
  •  心在旅途
    2020-12-21 07:16

    I think @rpanai answer on the original post is still the best. Here I share my tests:

    def qualifies(dt, excluded_months = []):
        if dt.day < 5:
            return False
        if (dt + pd.tseries.offsets.MonthBegin(1) - dt).days < 5:
            return False
        if dt.month in excluded_months:
            return False
        return True
    
    def new_qualifies(dt, excluded_months = []):
        dt = pd.Timestamp(dt)
        if dt.day < 5:
            return False
        if (dt + pd.tseries.offsets.MonthBegin(1) - dt).days < 5:
            return False
        if dt.month in excluded_months:
            return False
        return True
    
    df = pd.DataFrame({'date': pd.date_range('2020-01-01', freq='7D', periods=12000)})
    

    apply method:

    %%timeit
    df['qualifies1'] = df['date'].apply(lambda x: qualifies(x, [3, 8]))
    

    385 ms ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


    conversion method:

    %%timeit
    df['qualifies1'] = df['date'].apply(lambda x: new_qualifies(x, [3, 8]))
    

    389 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


    vectorized code:

    %%timeit
    df['qualifies2'] =  np.logical_not((df['date'].dt.day<5).values | \
        ((df['date']+pd.tseries.offsets.MonthBegin(1)-df['date']).dt.days < 5).values |\
        (df['date'].dt.month.isin([3, 8])).values)
    

    4.83 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

提交回复
热议问题