I would like to parallelize the following code:
for row in df.iterrows():
idx = row[0]
k = row[1][\'Chromosome\']
start,end = row[1][\'Bin\'].spl
A faster way (about 10% in my case):
Main differences to accepted answer:
use pd.concat
and np.array_split
to split and join the dataframre.
import multiprocessing
import numpy as np
def parallelize_dataframe(df, func):
num_cores = multiprocessing.cpu_count()-1 #leave one free to not freeze machine
num_partitions = num_cores #number of partitions to split dataframe
df_split = np.array_split(df, num_partitions)
pool = multiprocessing.Pool(num_cores)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
where func
is the function you want to apply to df
. Use partial(func, arg=arg_val)
for more that one argument.