Fill up a dictionary in parallel with multiprocessing

妖精的绣舞 提交于 2019-12-01 17:28:54

You are incurring a good amount of overhead in (1) starting up each process, and (2) having to copy the pandas.DataFrame (and etc) across several processes. If you just need to have a dict filled in parallel, I'd suggest using a shared memory dict. If no key will be overwritten, then it's easy and you don't have to worry about locks.

(Note I'm using multiprocess below, which is a fork of multiprocessing -- but only so I can demonstrate from the interpreter, otherwise, you'd have to do the below from __main__).

>>> from multiprocess import Process, Manager
>>> 
>>> def f(d, x):
...   d[x] = x**2
... 
>>> manager = Manager()
>>> d = manager.dict()
>>> job = [Process(target=f, args=(d, i)) for i in range(5)]
>>> _ = [p.start() for p in job]
>>> _ = [p.join() for p in job]
>>> print d
{0: 0, 1: 1, 2: 4, 3: 9, 4: 16}

This solution doesn't make copies of the dict to share across processes, so that part of the overhead is reduced. For large objects like a pandas.DataFrame, it can be significant compared to the cost of a simple operation like x**2. Similarly, spawning a Process can take time, and you maybe be able to do the above even faster (for lightweight objects) by using threads (i.e. from multiprocess.dummy instead of multiprocess for either your originally posted solution or mine above).

If you do need to share DataFrames (as your code suggests instead of as the question asks), you might be able to do it by creating a shared memory numpy.ndarray.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!