Why is np.where faster than pd.apply

前端 未结 2 748
时光说笑
时光说笑 2020-12-01 21:50

Sample code is here

import pandas as pd
import numpy as np

df = pd.DataFrame({\'Customer\' : [\'Bob\', \'Ken\', \'Steve\', \'Joe\'],
                   \'Sp         


        
相关标签:
2条回答
  • 2020-12-01 22:39

    Just adding a visualization approach to what have been said.

    Profile and total cumulative time of df.apply :

    We can see that the cimulative time is 13.8s.

    Profile and total cumulative time of np.where :

    Here, the cumulative time is 5.44ms which is 2500 times faster than df.apply

    The figure above were obtained using the library snakeviz. Here is a link to the library.

    SnakeViz displays profiles as a sunburst in which functions are represented as arcs. A root function is a circle at the middle, with functions it calls around, then the functions those functions call, and so on. The amount of time spent inside a function is represented by the angular width of the arc. An arc that wraps most of the way around the circle represents a function that is taking up most of the time of its calling function, while a skinny arc represents a function that is using hardly any time at all.

    0 讨论(0)
  • 2020-12-01 22:40

    I think np.where is faster because use numpy array vectorized way and pandas is built on this arrays.

    df.apply is slow, because it use loops.

    vectorize operations are the fastest, then cython routines and then apply.

    See this answer with better explanation of developer of pandas - Jeff.

    0 讨论(0)
提交回复
热议问题