So, I am playing around with multiprocessing.Pool and Numpy, but it seems I missed some important point. Why is the pool version much
I also noticed that when I ran numpy matrix multiplication inside of a Pool.map() function, it ran much slower on certain machines. My goal was to parallelize my work using Pool.map(), and run a process on each core of my machine. When things were running fast, the numpy matrix multiplication was only a small part of the overall work performed in parallel. When I looked at the CPU usage of the processes, I could see that each process could use e.g. 400+% CPU on the machines where it ran slow, but always <=100% on the machines where it ran fast. For me, the solution was to stop numpy from multithreading. It turns out that numpy was set up to multithread on exactly the machines where my Pool.map() was running slow. Evidently, if you are already parallelizing using Pool.map(), then having numpy also parallelize just creates interference. I just called export MKL_NUM_THREADS=1 before running my Python code and it worked fast everywhere.