问题
An obligatory assurance that I have read the many posts on the topic before posting this. I'm aware that multiprocessing entails a fixed cost, but to the best of my knowledge this doesn't seem to be the problem here.
I basically have a number of separate optimisation problems, and want to solve them in parallel. The following code is a simple example:
import psutil
import multiprocessing as mp
import time
from scipy.optimize import minimize
import numpy as np
pset = np.random.uniform(-10,10,500)
def func(x,p):
out= (x-p)**2
return out
def object(p):
def func2(x):
return func(x,p)
output = minimize(func2,
x0,
method = 'trust-constr')
xstar = output.x
return xstar
# 1. Loop
tic = time.perf_counter()
out_list = []
x0 = 0
for p in pset:
xstar = object(p)
out_list.append(xstar)
#print(np.vstack(out_list))
toc = time.perf_counter()
print(f'Loop done in {toc - tic:0.4f} seconds')
# 2. Pool
n_cpu = psutil.cpu_count(logical = False)
if __name__ == '__main__':
pool = mp.Pool(n_cpu)
#results = pool.map_async(object, pset).get()
results = pool.map(object,pset)
pool.close()
pool.join()
#print(np.vstack(results))
toc2 = time.perf_counter()
print(f'Pool done in {toc2 - toc:0.4f} seconds')
As you can see, the 'pool' method takes longer and increasingly so the more problems there are to solve (hence my conjecture that this isn't a fixed cost issue). My optimisation problem is actually a lot more complicated than this, and while loop will take a few minutes to solve say 3 problems, 'pool' will keep on running for a long long time, at least 15 minutes before I decide to force terminate.
What could be the reason for this inferior performance? Is there a problem with using parallel computing for optimisation problems? What other tricks could I try to speed things up?
回答1:
Q : "What could be the reason for this inferior performance?"
Welcome to the real-world of how computing devices actually work.
Your code makes a few performance sins, so let's dig 'em up, ok?
The code makes 500 .map()
-ed calls of an object()
function and pays immense upfront costs of spawning n-copies of the whole python session into replicated processes (so as to escape from the monopolistic GIL-lock ownership re-[SERIAL]
-isation, that would appear otherwise - read other great posts on this subject, if not aware about GIL-details ), yet the actual work, that is being delegated to such expensive "remote"-d processes is to just run a .minimize()
-method, driven by a square of a difference. In other words, all the .cpu_count()
-times replicated ( memory-IO allocated (thus swapping, if having headbanged into any of the physical RAM-size and O/S memory-manager ceilings ) + copied the full range of the main-python interpreter process memory-image ... all data-structures included ... - yes, that is what happens to the win O/S, somehow least devastating costs on linux O/S-es ).
So quite an expensive product of calling just an innocent pool.map() SLOC, isn't it?
Next comes the SER/DES + communication costs when passing parameters there and back. This bears a few kB
here for the payloads on a way "there" and a few B
on the way of the results going "back", so you happily do not sense much of this kind of pain, that may hurt your code in some other, less happy use-cases, yet you still do pay it... yes, again by some additional add-on overhead times, each time of the 500 .map()
-ed calls.
Next comes the worse part. Having requested .cpu_count()
-many processes to stand in the O/S process-scheduler queue, so as to get some time to grab a CPU-core and execute (for some O/S process-scheduler granted time, before being forcibly moved out-of-the-CPU-core, so as to let some other O/S-assigned process to move in & execute - this is how process-scheduling works), you might already feel the smoke, that this comes at an additional cost of add-on overhead times, consumed by the (heavy) lifting, once many processes stay waiting in the queue for their respective turn.
Next comes the worst penalty - you easily, almost granted, start to pay ~ 300x ~
higher memory-access costs, as upon your scheduled process re-entry to the CPU-core has lost the chance to reuse the fastest computing resource, the L1_data
on-core cache, first by having any such fast-2b-re-used data overwritten by some other, stored there by a processes, that has used this CPU-core before your process gets its next CPU-core-share to continue, after a previous process has been forcedly removed by the O/S scheduler, leaving but LRU-cache data you never need to reuse & so your process will pay extra costs to re-fetch your data again (paying costs of not less than ~100 ~380 \[ns\] to re-fetch data from main RAM, ... if memory-I/O channels permit to be served without further waiting for a free channel ... ). This will most probably happen per each O/S-process-scheduler process move-out/move-in cycle, as in most cases you do not even get to the same CPU-core as you have been camped last time (so even no chance to speculate on some residuals, that might "remain" in the L1_data
on-core cache from your "last round" the process was assigned to some of the available CPU-cores.
Last but not least - contemporary CPU perform another level of process-2-core "optimisation" strategy, so as to reduce the risk of hitting the thermal-ceiling. So, the processes move "around" even more often, so as to allow 'em to work on colder CPU-cores and leaving "hot"-cores cold down. That strategy works fine for light workloads, where a few computing-intensive processes may enjoy thermal-motivated jumping from a hot CPU-core to another, colder CPU-core, because otherwise, if it were left on the hot CPU-core, the hot silicon-fabric will slow the frequency of such CPU-core, so as to permit the thermal-ceiling not be overcome - and yes, you get it - your processing will get slower as at the reduced CPU-frequency you get less CPU-clocks and less computing will take place, until this hot-CPU-core gets colder (which is a sort of oxymoron for heavy-computing jobs, isn't it?). If this were for a few processes on a multicore-CPU, such thermal-strategy may seem attractive to show high-GHz-clocking and thermal-jumps from hot CPU-core to colder CPU-core, but that -for obvious reasons...- stops working if you .map()
-processes to "cover" all CPU-cores (not mentioning all other O/S-orchestrated processes in the queue), so the only result will be, that all-hot CPU-cores will slow down the CPU-frequency and will work in slow-motion so as to withstand the thermal-ceiling limitations.
Weird?
No. This is how the contemporary silicon toys work. Speculative strategies work nice, but only for a few and rather light-weight workloads. Then you start suffering from the reality of all the constraints that the laws of physics drive (which were until that hidden by the over-hyped marketing slogans, valid only for vast-excess of (cold) resources and light-weight computing / memory-I/O traffic patterns).
More reads on this + an analysis on modern criticism of the original ( overhead-naive ) Amdahl's Law argument is here.
So, welcome to the reality of computing :o)
来源:https://stackoverflow.com/questions/64636480/python-using-multiprocessing-is-much-slower-than-loop-for-optimisation-problem