I am performing some large computations on 3 different numpy 2D arrays sequentially. The arrays are huge, 25000x25000 each. Each computation takes significant time so I deci
Here is an example using np.memmap
and Pool
. See that you can define the number of processes and workers. In this case you don't have control over the queue, which can be achieved using multiprocessing.Queue
:
from multiprocessing import Pool
import numpy as np
def mysum(array_file_name, col1, col2, shape):
a = np.memmap(array_file_name, shape=shape, mode='r+')
a[:, col1:col2] = np.random.random((shape[0], col2-col1))
ans = a[:, col1:col2].sum()
del a
return ans
if __name__ == '__main__':
nop = 1000 # number_of_processes
now = 3 # number of workers
p = Pool(now)
array_file_name = 'test.array'
shape = (250000, 250000)
a = np.memmap(array_file_name, shape=shape, mode='w+')
del a
cols = [[shape[1]*i/nop, shape[1]*(i+1)/nop] for i in range(nop)]
results = []
for c1, c2 in cols:
r = p.apply_async(mysum, args=(array_file_name, c1, c2, shape))
results.append(r)
p.close()
p.join()
final_result = sum([r.get() for r in results])
print final_result
You can achieve better performances using shared memory parallel processing, when possible. See this related question: