Pandas df.iterrows() parallelization

后端 未结 3 1751
长情又很酷
长情又很酷 2020-12-02 10:02

I would like to parallelize the following code:

for row in df.iterrows():
    idx = row[0]
    k = row[1][\'Chromosome\']
    start,end = row[1][\'Bin\'].spl         


        
3条回答
  •  粉色の甜心
    2020-12-02 10:39

    A faster way (about 10% in my case):

    Main differences to accepted answer: use pd.concat and np.array_split to split and join the dataframre.

    import multiprocessing
    import numpy as np
    
    
    def parallelize_dataframe(df, func):
        num_cores = multiprocessing.cpu_count()-1  #leave one free to not freeze machine
        num_partitions = num_cores #number of partitions to split dataframe
        df_split = np.array_split(df, num_partitions)
        pool = multiprocessing.Pool(num_cores)
        df = pd.concat(pool.map(func, df_split))
        pool.close()
        pool.join()
        return df
    

    where func is the function you want to apply to df. Use partial(func, arg=arg_val) for more that one argument.

提交回复
热议问题