Python's multiprocessing: speed up a for-loop for several sets of parameters, “apply” vs. “apply_async”

问题

I would like to integrate a system of differential equations using a lot of different parameter combinations and store the variables’ final values that belong to a certain set of parameters. Therefore, I implemented a simple for-loop in which random initial conditions and parameter combinations are created, the system is integrated and the values of interest are stored in the respective arrays. Since I intend to do this for many parameter combinations for a rather complex system (here I only use a toy system for illustration), which can also become stiff, I would like to parallelize the simulations to speed up the process using Python’s “multiprocessing” module.

However, when I run the simulations, the for-loop is always faster than its parallelized version. The only way to be faster than the for-loop I’ve found so far, is to use “apply_async” instead of “apply”. For 10 different parameter combinations, I get for example the following output (using the code from below):

The for loop took  0.11986207962 seconds!
[ 41.75971761  48.06034375  38.74134139  25.6022232   46.48436046
  46.34952734  50.9073202   48.26035086  50.05026187  41.79483135]
Using apply took  0.180637836456 seconds!
41.7597176061
48.0603437545
38.7413413879
25.6022231983
46.4843604574
46.3495273394
50.9073202011
48.2603508573
50.0502618731
41.7948313502
Using apply_async took  0.000414133071899 seconds!
41.7597176061
48.0603437545
38.7413413879
25.6022231983
46.4843604574
46.3495273394
50.9073202011
48.2603508573
50.0502618731
41.7948313502

Although in this example the order of the results are identical for “apply” and “apply_async”, this seems not to be true in general. So, I would like to use “apply_async” since it is much faster but in this case I don’t know how I can match the outcome of the simulations to the parameters/initial conditions I used for the respective simulation.

My questions are therefore:

1) Why is “apply” much slowlier than the simple for-loop in this case?

2) When I use “apply_async” instead of “apply“, the parallelized version becomes very much faster than the for-loop but how can I then match the outcome of the simulations to the parameters I used in the respective simulation?

3) In this case, the results of “apply” and “apply_async” have the same order. Why is that? Coincidence?

My code can be found below:

from pylab import *
import multiprocessing as mp
from scipy.integrate import odeint
import time

#my system of differential equations
def myODE (yn,tvec,allpara):

    (x, y, z) = yn

    a, b = allpara['para']

    dx  = -x + a*y + x*x*y
    dy = b - a*y - x*x*y
    dz = x*y

    return (dx, dy, dz) 

#for reproducibility    
seed(0) 

#time settings for integration
dt = 0.01
tmax = 50
tval = arange(0,tmax,dt)

numVar = 3 #number of variables (x, y, z)
numPar = 2 #number of parameters (a, b)
numComb = 10 #number of parameter combinations

INIT = zeros((numComb,numVar)) #initial conditions will be stored here
PARA = zeros((numComb,numPar)) #parameter combinations for a and b will be stored here
RES = zeros(numComb) #z(tmax) will be stored here

tic = time.time()

for combi in range(numComb):

    INIT[combi,:] = append(10*rand(2),0) #initial conditions for x and y are randomly chosen, z is 0

    PARA[combi,:] = 10*rand(2) #parameter a and b are chosen randomly

    allpara = {'para': PARA[combi,:]}

    results = transpose(odeint(myODE, INIT[combi,:], tval, args=(allpara,))) #integrate system

    RES[combi] = results[numVar - 1][-1] #store z

    #INIT[combi,:] = results[:,-1] #update initial conditions
    #INIT[combi,-1] = 0 #set z to 0

toc = time.time()

print 'The for loop took ', toc-tic, 'seconds!'

print RES

#function for the multi-processing part
def runMyODE(yn,tvec,allpara):

    return transpose(odeint(myODE, yn, tvec, args=(allpara,)))

tic = time.time()

pool = mp.Pool(processes=4)
results = [pool.apply(runMyODE, args=(INIT[combi,:],tval,{'para': PARA[combi,:]})) for combi in range(numComb)]

toc = time.time()

print 'Using apply took ', toc-tic, 'seconds!'

for sol in range(numComb):
    print results[sol][2,-1] #print final value of z

tic = time.time()    
resultsAsync = [pool.apply_async(runMyODE, args=(INIT[combi,:],tval,{'para': PARA[combi,:]})) for combi in range(numComb)]    
toc = time.time()
print 'Using apply_async took ', toc-tic, 'seconds!'

for sol in range(numComb):
    print resultsAsync[sol].get()[2,-1] #print final value of z

回答1:

Note that the fact that your apply_async is 289 times faster then the for loop is a little suspicious! And right now, you're guaranteed to get the results in the order they're submitted, even if that isn't what you want for maximum parallelism.

apply_async starts a task, it doesn't wait until it's completed; .get() does that. So this:

tic = time.time()    
resultsAsync = [pool.apply_async(runMyODE, args=(INIT[combi,:],tval,{'para': PARA[combi,:]})) for combi in range(numComb)]    
toc = time.time()

Isn't really a very fair measurement; you've started all the tasks, but they're not necessarily completed yet.

On the other hand, once you .get() the results, you know that the task has completed and that you have the answer; so doing this

for sol in range(numComb):
    print resultsAsync[sol].get()[2,-1] #print final value of z

Means that for sure you have the results in order (because you're going through the ApplyResult objects in order and .get()ing them); but you might want to have the results as soon as they're ready rather than doing a blocking wait on the steps one at a time. But that means you'd need to label the results with their parameters one way or another.

You can use callbacks to save the results once the tasks are done, and return the parameters along with the results, to allow completely asynchronous returns:

def runMyODE(yn,tvec,allpara):
    return allpara['para'],transpose(odeint(myODE, yn, tvec, args=(allpara,)))

asyncResults = []

def saveResult(result):
    asyncResults.append((result[0], result[1][2,-1]))

tic = time.time()
for combi in range(numComb):
    pool.apply_async(runMyODE, args=(INIT[combi,:],tval,{'para': PARA[combi,:]}), callback=saveResult)
pool.close()
pool.join()
toc = time.time()

print 'Using apply_async took ', toc-tic, 'seconds!'

for res in asyncResults:
    print res[0], res[1]

Gives you a more reasonable time; the results are still almost always in order because the tasks take very similar amounts of time:

Using apply took  0.0847041606903 seconds!
[ 6.02763376  5.44883183] 41.7597176061
[ 4.37587211  8.91773001] 48.0603437545
[ 7.91725038  5.2889492 ] 38.7413413879
[ 0.71036058  0.871293  ] 25.6022231983
[ 7.78156751  8.70012148] 46.4843604574
[ 4.61479362  7.80529176] 46.3495273394
[ 1.43353287  9.44668917] 50.9073202011
[ 2.64555612  7.74233689] 48.2603508573
[ 0.187898    6.17635497] 50.0502618731
[ 9.43748079  6.81820299] 41.7948313502
Using apply_async took  0.0259671211243 seconds!
[ 4.37587211  8.91773001] 48.0603437545
[ 0.71036058  0.871293  ] 25.6022231983
[ 6.02763376  5.44883183] 41.7597176061
[ 7.91725038  5.2889492 ] 38.7413413879
[ 7.78156751  8.70012148] 46.4843604574
[ 4.61479362  7.80529176] 46.3495273394
[ 1.43353287  9.44668917] 50.9073202011
[ 2.64555612  7.74233689] 48.2603508573
[ 0.187898    6.17635497] 50.0502618731
[ 9.43748079  6.81820299] 41.7948313502

Note that rather than looping over apply, you could also use map:

pool.map_async(lambda combi: runMyODE(INIT[combi,:], tval, para=PARA[combi,:]), range(numComb), callback=saveResult)

来源：https://stackoverflow.com/questions/30282688/pythons-multiprocessing-speed-up-a-for-loop-for-several-sets-of-parameters-a

标签

python

performance

parallel-processing

python-multiprocessing

odeint