I\'m having unwanted behaviour come out of np.vectorize, namely, it changes the datatype of the argument going into the original function. My original question
I think @rpanai answer on the original post is still the best. Here I share my tests:
def qualifies(dt, excluded_months = []):
if dt.day < 5:
return False
if (dt + pd.tseries.offsets.MonthBegin(1) - dt).days < 5:
return False
if dt.month in excluded_months:
return False
return True
def new_qualifies(dt, excluded_months = []):
dt = pd.Timestamp(dt)
if dt.day < 5:
return False
if (dt + pd.tseries.offsets.MonthBegin(1) - dt).days < 5:
return False
if dt.month in excluded_months:
return False
return True
df = pd.DataFrame({'date': pd.date_range('2020-01-01', freq='7D', periods=12000)})
apply method:
%%timeit
df['qualifies1'] = df['date'].apply(lambda x: qualifies(x, [3, 8]))
385 ms ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
conversion method:
%%timeit
df['qualifies1'] = df['date'].apply(lambda x: new_qualifies(x, [3, 8]))
389 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
vectorized code:
%%timeit
df['qualifies2'] = np.logical_not((df['date'].dt.day<5).values | \
((df['date']+pd.tseries.offsets.MonthBegin(1)-df['date']).dt.days < 5).values |\
(df['date'].dt.month.isin([3, 8])).values)
4.83 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)