Python: Using multiprocessing is much slower than loop for optimisation problem. What am I doing wrong?

问题

An obligatory assurance that I have read the many posts on the topic before posting this. I'm aware that multiprocessing entails a fixed cost, but to the best of my knowledge this doesn't seem to be the problem here.

I basically have a number of separate optimisation problems, and want to solve them in parallel. The following code is a simple example:

import psutil
import multiprocessing as mp
import time
from scipy.optimize import minimize
import numpy as np

pset = np.random.uniform(-10,10,500)

def func(x,p):
    out= (x-p)**2
    return out

def object(p):
    def func2(x):
        return func(x,p)    
    output = minimize(func2,
                  x0,
                  method = 'trust-constr')
    xstar = output.x
    return xstar

# 1. Loop
tic = time.perf_counter()
out_list = []
x0 = 0

for p in pset:
    xstar = object(p)
    out_list.append(xstar)

#print(np.vstack(out_list))

toc = time.perf_counter()
print(f'Loop done in {toc - tic:0.4f} seconds')

# 2. Pool
n_cpu = psutil.cpu_count(logical = False)
if __name__ == '__main__':
    pool = mp.Pool(n_cpu)
    #results = pool.map_async(object, pset).get()
    results = pool.map(object,pset)
    pool.close()
    pool.join()
    #print(np.vstack(results))

toc2 = time.perf_counter()
print(f'Pool done in {toc2 - toc:0.4f} seconds')

As you can see, the 'pool' method takes longer and increasingly so the more problems there are to solve (hence my conjecture that this isn't a fixed cost issue). My optimisation problem is actually a lot more complicated than this, and while loop will take a few minutes to solve say 3 problems, 'pool' will keep on running for a long long time, at least 15 minutes before I decide to force terminate.

What could be the reason for this inferior performance? Is there a problem with using parallel computing for optimisation problems? What other tricks could I try to speed things up?

回答1:

Q : "What could be the reason for this inferior performance?"

Welcome to the real-world of how computing devices actually work.

Your code makes a few performance sins, so let's dig 'em up, ok?

The code makes 500 .map()-ed calls of an object() function and pays immense upfront costs of spawning n-copies of the whole python session into replicated processes (so as to escape from the monopolistic GIL-lock ownership re-[SERIAL]-isation, that would appear otherwise - read other great posts on this subject, if not aware about GIL-details ), yet the actual work, that is being delegated to such expensive "remote"-d processes is to just run a .minimize()-method, driven by a square of a difference. In other words, all the .cpu_count()-times replicated ( memory-IO allocated (thus swapping, if having headbanged into any of the physical RAM-size and O/S memory-manager ceilings ) + copied the full range of the main-python interpreter process memory-image ... all data-structures included ... - yes, that is what happens to the win O/S, somehow least devastating costs on linux O/S-es ).

So quite an expensive product of calling just an innocent pool.map() SLOC, isn't it?

Next comes the SER/DES + communication costs when passing parameters there and back. This bears a few kB here for the payloads on a way "there" and a few B on the way of the results going "back", so you happily do not sense much of this kind of pain, that may hurt your code in some other, less happy use-cases, yet you still do pay it... yes, again by some additional add-on overhead times, each time of the 500 .map()-ed calls.

Next comes the worse part. Having requested .cpu_count()-many processes to stand in the O/S process-scheduler queue, so as to get some time to grab a CPU-core and execute (for some O/S process-scheduler granted time, before being forcibly moved out-of-the-CPU-core, so as to let some other O/S-assigned process to move in & execute - this is how process-scheduling works), you might already feel the smoke, that this comes at an additional cost of add-on overhead times, consumed by the (heavy) lifting, once many processes stay waiting in the queue for their respective turn.

Next comes the worst penalty - you easily, almost granted, start to pay ~ 300x ~ higher memory-access costs, as upon your scheduled process re-entry to the CPU-core has lost the chance to reuse the fastest computing resource, the L1_data on-core cache, first by having any such fast-2b-re-used data overwritten by some other, stored there by a processes, that has used this CPU-core before your process gets its next CPU-core-share to continue, after a previous process has been forcedly removed by the O/S scheduler, leaving but LRU-cache data you never need to reuse & so your process will pay extra costs to re-fetch your data again (paying costs of not less than ~100 ~380 \[ns\] to re-fetch data from main RAM, ... if memory-I/O channels permit to be served without further waiting for a free channel ... ). This will most probably happen per each O/S-process-scheduler process move-out/move-in cycle, as in most cases you do not even get to the same CPU-core as you have been camped last time (so even no chance to speculate on some residuals, that might "remain" in the L1_data on-core cache from your "last round" the process was assigned to some of the available CPU-cores.

Last but not least - contemporary CPU perform another level of process-2-core "optimisation" strategy, so as to reduce the risk of hitting the thermal-ceiling. So, the processes move "around" even more often, so as to allow 'em to work on colder CPU-cores and leaving "hot"-cores cold down. That strategy works fine for light workloads, where a few computing-intensive processes may enjoy thermal-motivated jumping from a hot CPU-core to another, colder CPU-core, because otherwise, if it were left on the hot CPU-core, the hot silicon-fabric will slow the frequency of such CPU-core, so as to permit the thermal-ceiling not be overcome - and yes, you get it - your processing will get slower as at the reduced CPU-frequency you get less CPU-clocks and less computing will take place, until this hot-CPU-core gets colder (which is a sort of oxymoron for heavy-computing jobs, isn't it?). If this were for a few processes on a multicore-CPU, such thermal-strategy may seem attractive to show high-GHz-clocking and thermal-jumps from hot CPU-core to colder CPU-core, but that -for obvious reasons...- stops working if you .map()-processes to "cover" all CPU-cores (not mentioning all other O/S-orchestrated processes in the queue), so the only result will be, that all-hot CPU-cores will slow down the CPU-frequency and will work in slow-motion so as to withstand the thermal-ceiling limitations.

Weird?

No. This is how the contemporary silicon toys work. Speculative strategies work nice, but only for a few and rather light-weight workloads. Then you start suffering from the reality of all the constraints that the laws of physics drive (which were until that hidden by the over-hyped marketing slogans, valid only for vast-excess of (cold) resources and light-weight computing / memory-I/O traffic patterns).

More reads on this + an analysis on modern criticism of the original ( overhead-naive ) Amdahl's Law argument is here.

So, welcome to the reality of computing :o)

来源：https://stackoverflow.com/questions/64636480/python-using-multiprocessing-is-much-slower-than-loop-for-optimisation-problem

标签

python

optimization

parallel-processing