I often use Pandas mask and where methods for cleaner logic when updating values in a series conditionally. However, for relatively performance-critical code I notice a sign
I'm using pandas 0.23.3 and Python 3.6, so I can see a real difference in running time only for your second example.
But let's investigate a slightly different version of your second example (so we get2*df[0]
out of the way). Here is our baseline on my machine:
twice = df[0]*2
mask = df[0] > 0.5
%timeit np.where(mask, twice, df[0])
# 61.4 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df[0].mask(mask, twice)
# 143 ms ± 5.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numpy's version is about 2.3 times faster than pandas.
So let's profile both functions to see the difference - profiling is a good way to get the big picture when one isn't very familiar with the code basis: it is faster than debugging and less error-prone than trying to figure out what's going on just by reading the code.
I'm on Linux and use perf. For the numpy's version we get (for the listing see appendix A):
>>> perf record python np_where.py
>>> perf report
Overhead Command Shared Object Symbol
68,50% python multiarray.cpython-36m-x86_64-linux-gnu.so [.] PyArray_Where
8,96% python [unknown] [k] 0xffffffff8140290c
1,57% python mtrand.cpython-36m-x86_64-linux-gnu.so [.] rk_random
As we can see, the lion's share of the time is spent in PyArray_Where
- about 69%. The unknown symbol is a kernel function (as matter of fact clear_page
) - I run without root privileges so the symbol is not resolved.
And for pandas we get (see Appendix B for code):
>>> perf record python pd_mask.py
>>> perf report
Overhead Command Shared Object Symbol
37,12% python interpreter.cpython-36m-x86_64-linux-gnu.so [.] vm_engine_iter_task
23,36% python libc-2.23.so [.] __memmove_ssse3_back
19,78% python [unknown] [k] 0xffffffff8140290c
3,32% python umath.cpython-36m-x86_64-linux-gnu.so [.] DOUBLE_isnan
1,48% python umath.cpython-36m-x86_64-linux-gnu.so [.] BOOL_logical_not
Quite a different situation:
PyArray_Where
under the hood - the most prominent time-consumer is vm_engine_iter_task
, which is numexpr-functionality.__memmove_ssse3_back
uses about 25
% of time! Probably some of the kernel's functions are also connected to memory-accesses.Actually, pandas-0.19 used PyArray_Where
under the hood, for the older version the perf-report would look like:
Overhead Command Shared Object Symbol
32,42% python multiarray.so [.] PyArray_Where
30,25% python libc-2.23.so [.] __memmove_ssse3_back
21,31% python [kernel.kallsyms] [k] clear_page
1,72% python [kernel.kallsyms] [k] __schedule
So basically it would use np.where
under the hood + some overhead (all above data-copying, see __memmove_ssse3_back
) back then.
I see no scenario where pandas could become faster than numpy in pandas' version 0.19 - it just adds overhead to numpy's functionality. Pandas' version 0.23.3 is an entirely different story - here numexpr-module is used, it is very possible that there are scenarios for which pandas' version is (at least slightly) faster.
I'm not sure this memory-copying is really called for/necessary - maybe one even could call it performance-bug, but I just don't know enough to be certain.
We could help pandas not to copy, by peeling away some indirections (passing np.array
instead of pd.Series
). For example:
%timeit df[0].mask(mask.values > 0.5, twice.values)
# 75.7 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now, pandas is only 25% slower. The perf says:
Overhead Command Shared Object Symbol
50,81% python interpreter.cpython-36m-x86_64-linux-gnu.so [.] vm_engine_iter_task
14,12% python [unknown] [k] 0xffffffff8140290c
9,93% python libc-2.23.so [.] __memmove_ssse3_back
4,61% python umath.cpython-36m-x86_64-linux-gnu.so [.] DOUBLE_isnan
2,01% python umath.cpython-36m-x86_64-linux-gnu.so [.] BOOL_logical_not
Much less data-copying, but still more than in the numpy's version which is mostly responsible for the overhead.
My key take-aways from it:
pandas has the potential to be at least slightly faster than numpy (because it is possible to be faster). However, pandas' somewhat opaque handling of data-copying makes it hard to predict when this potential is overshadowed by (unnecessary) data copying.
when the performance of where
/mask
is the bottleneck, I would use numba/cython to improve the performance - see my rather naive tries to use numba and cython further below.
The idea is to take
np.where(df[0] > 0.5, df[0]*2, df[0])
version and to eliminate the need to create a temporary - i.e, df[0]*2
.
As proposed by @max9111, using numba:
import numba as nb
@nb.njit
def nb_where(df):
n = len(df)
output = np.empty(n, dtype=np.float64)
for i in range(n):
if df[i]>0.5:
output[i] = 2.0*df[i]
else:
output[i] = df[i]
return output
assert(np.where(df[0] > 0.5, twice, df[0])==nb_where(df[0].values)).all()
%timeit np.where(df[0] > 0.5, df[0]*2, df[0])
# 85.1 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit nb_where(df[0].values)
# 17.4 ms ± 673 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Which is about factor 5 faster than the numpy's version!
And here is my by far less successful try to improve the performance with help of Cython:
%%cython -a
cimport numpy as np
import numpy as np
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def cy_where(double[::1] df):
cdef int i
cdef int n = len(df)
cdef np.ndarray[np.float64_t] output = np.empty(n, dtype=np.float64)
for i in range(n):
if df[i]>0.5:
output[i] = 2.0*df[i]
else:
output[i] = df[i]
return output
assert (df[0].mask(df[0] > 0.5, 2*df[0]).values == cy_where(df[0].values)).all()
%timeit cy_where(df[0].values)
# 66.7± 753 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
gives 25% speed-up. Not sure, why cython is so much slower than numba though.
Listings:
A: np_where.py:
import pandas as pd
import numpy as np
np.random.seed(0)
n = 10000000
df = pd.DataFrame(np.random.random(n))
twice = df[0]*2
for _ in range(50):
np.where(df[0] > 0.5, twice, df[0])
B: pd_mask.py:
import pandas as pd
import numpy as np
np.random.seed(0)
n = 10000000
df = pd.DataFrame(np.random.random(n))
twice = df[0]*2
mask = df[0] > 0.5
for _ in range(50):
df[0].mask(mask, twice)