Pandas mask / where methods versus NumPy np.where

前端未结

关注

 1  1832

I often use Pandas mask and where methods for cleaner logic when updating values in a series conditionally. However, for relatively performance-critical code I notice a sign

相关标签:

1条回答

孤独总比滥情好

2020-12-29 20:45

I'm using pandas 0.23.3 and Python 3.6, so I can see a real difference in running time only for your second example.

But let's investigate a slightly different version of your second example (so we get2*df[0] out of the way). Here is our baseline on my machine:

twice = df[0]*2
mask = df[0] > 0.5
%timeit np.where(mask, twice, df[0])  
# 61.4 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df[0].mask(mask, twice)
# 143 ms ± 5.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Numpy's version is about 2.3 times faster than pandas.

So let's profile both functions to see the difference - profiling is a good way to get the big picture when one isn't very familiar with the code basis: it is faster than debugging and less error-prone than trying to figure out what's going on just by reading the code.

I'm on Linux and use perf. For the numpy's version we get (for the listing see appendix A):

>>> perf record python np_where.py
>>> perf report

Overhead  Command  Shared Object                                Symbol                              
  68,50%  python   multiarray.cpython-36m-x86_64-linux-gnu.so   [.] PyArray_Where
   8,96%  python   [unknown]                                    [k] 0xffffffff8140290c
   1,57%  python   mtrand.cpython-36m-x86_64-linux-gnu.so       [.] rk_random

As we can see, the lion's share of the time is spent in PyArray_Where - about 69%. The unknown symbol is a kernel function (as matter of fact clear_page) - I run without root privileges so the symbol is not resolved.

And for pandas we get (see Appendix B for code):

>>> perf record python pd_mask.py
>>> perf report

Overhead  Command  Shared Object                                Symbol                                                                                               
  37,12%  python   interpreter.cpython-36m-x86_64-linux-gnu.so  [.] vm_engine_iter_task
  23,36%  python   libc-2.23.so                                 [.] __memmove_ssse3_back
  19,78%  python   [unknown]                                    [k] 0xffffffff8140290c
   3,32%  python   umath.cpython-36m-x86_64-linux-gnu.so        [.] DOUBLE_isnan
   1,48%  python   umath.cpython-36m-x86_64-linux-gnu.so        [.] BOOL_logical_not

Quite a different situation:

pandas doesn't use PyArray_Where under the hood - the most prominent time-consumer is vm_engine_iter_task, which is numexpr-functionality.
there is some heavy memory-copying going on - __memmove_ssse3_back uses about 25% of time! Probably some of the kernel's functions are also connected to memory-accesses.

Actually, pandas-0.19 used PyArray_Where under the hood, for the older version the perf-report would look like:

Overhead  Command        Shared Object                     Symbol                                                                                                     
  32,42%  python         multiarray.so                     [.] PyArray_Where
  30,25%  python         libc-2.23.so                      [.] __memmove_ssse3_back
  21,31%  python         [kernel.kallsyms]                 [k] clear_page
   1,72%  python         [kernel.kallsyms]                 [k] __schedule

So basically it would use np.where under the hood + some overhead (all above data-copying, see __memmove_ssse3_back) back then.

I see no scenario where pandas could become faster than numpy in pandas' version 0.19 - it just adds overhead to numpy's functionality. Pandas' version 0.23.3 is an entirely different story - here numexpr-module is used, it is very possible that there are scenarios for which pandas' version is (at least slightly) faster.

I'm not sure this memory-copying is really called for/necessary - maybe one even could call it performance-bug, but I just don't know enough to be certain.

We could help pandas not to copy, by peeling away some indirections (passing np.array instead of pd.Series). For example:

%timeit df[0].mask(mask.values > 0.5, twice.values)
# 75.7 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Now, pandas is only 25% slower. The perf says:

Overhead  Command  Shared Object                                Symbol                                                                                                
  50,81%  python   interpreter.cpython-36m-x86_64-linux-gnu.so  [.] vm_engine_iter_task
  14,12%  python   [unknown]                                    [k] 0xffffffff8140290c
   9,93%  python   libc-2.23.so                                 [.] __memmove_ssse3_back
   4,61%  python   umath.cpython-36m-x86_64-linux-gnu.so        [.] DOUBLE_isnan
   2,01%  python   umath.cpython-36m-x86_64-linux-gnu.so        [.] BOOL_logical_not

Much less data-copying, but still more than in the numpy's version which is mostly responsible for the overhead.

My key take-aways from it:

pandas has the potential to be at least slightly faster than numpy (because it is possible to be faster). However, pandas' somewhat opaque handling of data-copying makes it hard to predict when this potential is overshadowed by (unnecessary) data copying.
when the performance of where/mask is the bottleneck, I would use numba/cython to improve the performance - see my rather naive tries to use numba and cython further below.

The idea is to take

np.where(df[0] > 0.5, df[0]*2, df[0])

version and to eliminate the need to create a temporary - i.e, df[0]*2.

As proposed by @max9111, using numba:

import numba as nb
@nb.njit
def nb_where(df):
    n = len(df)
    output = np.empty(n, dtype=np.float64)
    for i in range(n):
        if df[i]>0.5:
            output[i] = 2.0*df[i]
        else:
            output[i] = df[i]
    return output

assert(np.where(df[0] > 0.5, twice, df[0])==nb_where(df[0].values)).all()
%timeit np.where(df[0] > 0.5, df[0]*2, df[0])
# 85.1 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit nb_where(df[0].values)
# 17.4 ms ± 673 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Which is about factor 5 faster than the numpy's version!

And here is my by far less successful try to improve the performance with help of Cython:

%%cython -a
cimport numpy as np
import numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def cy_where(double[::1] df):
    cdef int i
    cdef int n = len(df)
    cdef np.ndarray[np.float64_t] output = np.empty(n, dtype=np.float64)
    for i in range(n):
        if df[i]>0.5:
            output[i] = 2.0*df[i]
        else:
            output[i] = df[i]
    return output

assert (df[0].mask(df[0] > 0.5, 2*df[0]).values == cy_where(df[0].values)).all()

%timeit cy_where(df[0].values)
# 66.7± 753 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

gives 25% speed-up. Not sure, why cython is so much slower than numba though.

Listings:

A: np_where.py:

import pandas as pd
import numpy as np

np.random.seed(0)

n = 10000000
df = pd.DataFrame(np.random.random(n))

twice = df[0]*2
for _ in range(50):
      np.where(df[0] > 0.5, twice, df[0])

B: pd_mask.py:

import pandas as pd
import numpy as np

np.random.seed(0)

n = 10000000
df = pd.DataFrame(np.random.random(n))

twice = df[0]*2
mask = df[0] > 0.5
for _ in range(50):
      df[0].mask(mask, twice)

0 讨论(0)