Pandas mask / where methods versus NumPy np.where

前端 未结 1 1825
误落风尘
误落风尘 2020-12-29 20:23

I often use Pandas mask and where methods for cleaner logic when updating values in a series conditionally. However, for relatively performance-critical code I notice a sign

相关标签:
1条回答
  • 2020-12-29 20:45

    I'm using pandas 0.23.3 and Python 3.6, so I can see a real difference in running time only for your second example.

    But let's investigate a slightly different version of your second example (so we get2*df[0] out of the way). Here is our baseline on my machine:

    twice = df[0]*2
    mask = df[0] > 0.5
    %timeit np.where(mask, twice, df[0])  
    # 61.4 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    %timeit df[0].mask(mask, twice)
    # 143 ms ± 5.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    Numpy's version is about 2.3 times faster than pandas.

    So let's profile both functions to see the difference - profiling is a good way to get the big picture when one isn't very familiar with the code basis: it is faster than debugging and less error-prone than trying to figure out what's going on just by reading the code.

    I'm on Linux and use perf. For the numpy's version we get (for the listing see appendix A):

    >>> perf record python np_where.py
    >>> perf report
    
    Overhead  Command  Shared Object                                Symbol                              
      68,50%  python   multiarray.cpython-36m-x86_64-linux-gnu.so   [.] PyArray_Where
       8,96%  python   [unknown]                                    [k] 0xffffffff8140290c
       1,57%  python   mtrand.cpython-36m-x86_64-linux-gnu.so       [.] rk_random
    

    As we can see, the lion's share of the time is spent in PyArray_Where - about 69%. The unknown symbol is a kernel function (as matter of fact clear_page) - I run without root privileges so the symbol is not resolved.

    And for pandas we get (see Appendix B for code):

    >>> perf record python pd_mask.py
    >>> perf report
    
    Overhead  Command  Shared Object                                Symbol                                                                                               
      37,12%  python   interpreter.cpython-36m-x86_64-linux-gnu.so  [.] vm_engine_iter_task
      23,36%  python   libc-2.23.so                                 [.] __memmove_ssse3_back
      19,78%  python   [unknown]                                    [k] 0xffffffff8140290c
       3,32%  python   umath.cpython-36m-x86_64-linux-gnu.so        [.] DOUBLE_isnan
       1,48%  python   umath.cpython-36m-x86_64-linux-gnu.so        [.] BOOL_logical_not
    

    Quite a different situation:

    • pandas doesn't use PyArray_Where under the hood - the most prominent time-consumer is vm_engine_iter_task, which is numexpr-functionality.
    • there is some heavy memory-copying going on - __memmove_ssse3_back uses about 25% of time! Probably some of the kernel's functions are also connected to memory-accesses.

    Actually, pandas-0.19 used PyArray_Where under the hood, for the older version the perf-report would look like:

    Overhead  Command        Shared Object                     Symbol                                                                                                     
      32,42%  python         multiarray.so                     [.] PyArray_Where
      30,25%  python         libc-2.23.so                      [.] __memmove_ssse3_back
      21,31%  python         [kernel.kallsyms]                 [k] clear_page
       1,72%  python         [kernel.kallsyms]                 [k] __schedule
    

    So basically it would use np.where under the hood + some overhead (all above data-copying, see __memmove_ssse3_back) back then.

    I see no scenario where pandas could become faster than numpy in pandas' version 0.19 - it just adds overhead to numpy's functionality. Pandas' version 0.23.3 is an entirely different story - here numexpr-module is used, it is very possible that there are scenarios for which pandas' version is (at least slightly) faster.

    I'm not sure this memory-copying is really called for/necessary - maybe one even could call it performance-bug, but I just don't know enough to be certain.

    We could help pandas not to copy, by peeling away some indirections (passing np.array instead of pd.Series). For example:

    %timeit df[0].mask(mask.values > 0.5, twice.values)
    # 75.7 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    Now, pandas is only 25% slower. The perf says:

    Overhead  Command  Shared Object                                Symbol                                                                                                
      50,81%  python   interpreter.cpython-36m-x86_64-linux-gnu.so  [.] vm_engine_iter_task
      14,12%  python   [unknown]                                    [k] 0xffffffff8140290c
       9,93%  python   libc-2.23.so                                 [.] __memmove_ssse3_back
       4,61%  python   umath.cpython-36m-x86_64-linux-gnu.so        [.] DOUBLE_isnan
       2,01%  python   umath.cpython-36m-x86_64-linux-gnu.so        [.] BOOL_logical_not
    

    Much less data-copying, but still more than in the numpy's version which is mostly responsible for the overhead.

    My key take-aways from it:

    • pandas has the potential to be at least slightly faster than numpy (because it is possible to be faster). However, pandas' somewhat opaque handling of data-copying makes it hard to predict when this potential is overshadowed by (unnecessary) data copying.

    • when the performance of where/mask is the bottleneck, I would use numba/cython to improve the performance - see my rather naive tries to use numba and cython further below.


    The idea is to take

    np.where(df[0] > 0.5, df[0]*2, df[0])
    

    version and to eliminate the need to create a temporary - i.e, df[0]*2.

    As proposed by @max9111, using numba:

    import numba as nb
    @nb.njit
    def nb_where(df):
        n = len(df)
        output = np.empty(n, dtype=np.float64)
        for i in range(n):
            if df[i]>0.5:
                output[i] = 2.0*df[i]
            else:
                output[i] = df[i]
        return output
    
    assert(np.where(df[0] > 0.5, twice, df[0])==nb_where(df[0].values)).all()
    %timeit np.where(df[0] > 0.5, df[0]*2, df[0])
    # 85.1 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    %timeit nb_where(df[0].values)
    # 17.4 ms ± 673 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    Which is about factor 5 faster than the numpy's version!

    And here is my by far less successful try to improve the performance with help of Cython:

    %%cython -a
    cimport numpy as np
    import numpy as np
    cimport cython
    
    @cython.boundscheck(False)
    @cython.wraparound(False)
    def cy_where(double[::1] df):
        cdef int i
        cdef int n = len(df)
        cdef np.ndarray[np.float64_t] output = np.empty(n, dtype=np.float64)
        for i in range(n):
            if df[i]>0.5:
                output[i] = 2.0*df[i]
            else:
                output[i] = df[i]
        return output
    
    assert (df[0].mask(df[0] > 0.5, 2*df[0]).values == cy_where(df[0].values)).all()
    
    %timeit cy_where(df[0].values)
    # 66.7± 753 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    gives 25% speed-up. Not sure, why cython is so much slower than numba though.


    Listings:

    A: np_where.py:

    import pandas as pd
    import numpy as np
    
    np.random.seed(0)
    
    n = 10000000
    df = pd.DataFrame(np.random.random(n))
    
    twice = df[0]*2
    for _ in range(50):
          np.where(df[0] > 0.5, twice, df[0])  
    

    B: pd_mask.py:

    import pandas as pd
    import numpy as np
    
    np.random.seed(0)
    
    n = 10000000
    df = pd.DataFrame(np.random.random(n))
    
    twice = df[0]*2
    mask = df[0] > 0.5
    for _ in range(50):
          df[0].mask(mask, twice)
    
    0 讨论(0)
提交回复
热议问题