Make Pandas DataFrame apply() use all cores?

前端 未结 6 1551
陌清茗
陌清茗 2020-11-27 09:52

As of August 2017, Pandas DataFame.apply() is unfortunately still limited to working with a single core, meaning that a multi-core machine will waste the majority of its com

6条回答
  •  Happy的楠姐
    2020-11-27 10:28

    The simplest way is to use Dask's map_partitions. You need these imports (you will need to pip install dask):

    import pandas as pd
    import dask.dataframe as dd
    from dask.multiprocessing import get
    

    and the syntax is

    data = 
    ddata = dd.from_pandas(data, npartitions=30)
    
    def myfunc(x,y,z, ...): return 
    
    res = ddata.map_partitions(lambda df: df.apply((lambda row: myfunc(*row)), axis=1)).compute(get=get)  
    

    (I believe that 30 is a suitable number of partitions if you have 16 cores). Just for completeness, I timed the difference on my machine (16 cores):

    data = pd.DataFrame()
    data['col1'] = np.random.normal(size = 1500000)
    data['col2'] = np.random.normal(size = 1500000)
    
    ddata = dd.from_pandas(data, npartitions=30)
    def myfunc(x,y): return y*(x**2+1)
    def apply_myfunc_to_DF(df): return df.apply((lambda row: myfunc(*row)), axis=1)
    def pandas_apply(): return apply_myfunc_to_DF(data)
    def dask_apply(): return ddata.map_partitions(apply_myfunc_to_DF).compute(get=get)  
    def vectorized(): return myfunc(data['col1'], data['col2']  )
    
    t_pds = timeit.Timer(lambda: pandas_apply())
    print(t_pds.timeit(number=1))
    

    28.16970546543598

    t_dsk = timeit.Timer(lambda: dask_apply())
    print(t_dsk.timeit(number=1))
    

    2.708152851089835

    t_vec = timeit.Timer(lambda: vectorized())
    print(t_vec.timeit(number=1))
    

    0.010668013244867325

    Giving a factor of 10 speedup going from pandas apply to dask apply on partitions. Of course, if you have a function you can vectorize, you should - in this case the function (y*(x**2+1)) is trivially vectorized, but there are plenty of things that are impossible to vectorize.

提交回复
热议问题