I am fairly new to python. I am using the multiprocessing module for reading lines of text on stdin, converting them in some way and writing them into a database. Here\'s a
apply_async
returns an AsyncResult object, which you can wait
on:
if len(batch) >= 10000:
r = pool.apply_async(insert, args=(batch, i+1))
r.wait()
batch = []
Though if you want to do this in a cleaner manner, you should use a multiprocessing.Queue with a maxsize
of 10000, and derive a Worker
class from multiprocessing.Process
that fetches from such a queue.
The apply_async
and map_async
functions are designed not to block the main process. In order to do so, the Pool
maintains an internal Queue
which size is unfortunately impossible to change.
The way the problem can be solved is by using a Semaphore
initialized with the size you want the queue to be. You acquire and release the semaphore before feeding the pool and after a worker has completed the task.
Here's an example working with Python 2.6 or greater.
from threading import Semaphore
from multiprocessing import Pool
def task_wrapper(f):
"""Python2 does not allow a callback for method raising exceptions,
this wrapper ensures the code run into the worker will be exception free.
"""
try:
return f()
except:
return None
class TaskManager(object):
def __init__(self, processes, queue_size):
self.pool = Pool(processes=processes)
self.workers = Semaphore(processes + queue_size)
def new_task(self, f):
"""Start a new task, blocks if queue is full."""
self.workers.acquire()
self.pool.apply_async(task_wrapper, args=(f, ), callback=self.task_done))
def task_done(self):
"""Called once task is done, releases the queue is blocked."""
self.workers.release()
Another example using concurrent.futures
pools implementation.
Just in case some one ends up here, this is how I solved the problem: I stopped using multiprocessing.Pool. Here is how I do it now:
#set amount of concurrent processes that insert db data
processes = multiprocessing.cpu_count() * 2
#setup batch queue
queue = multiprocessing.Queue(processes * 2)
#start processes
for _ in range(processes): multiprocessing.Process(target=insert, args=(queue,)).start()
#fill queue with batches
batch=[]
for i, content in enumerate(sys.stdin):
batch.append(content)
if len(batch) >= 10000:
queue.put((batch,i+1))
batch = []
if batch:
queue.put((batch,i+1))
#stop processes using poison-pill
for _ in range(processes): queue.put((None,None))
print "all done."
in the insert method the processing of each batch is wrapped in a loop that pulls from the queue until it receives the poison pill:
while True:
batch, end = queue.get()
if not batch and not end: return #poison pill! complete!
[process the batch]
print 'worker done.'
Not pretty, but you can access the internal queue size and wait until it's below your maximum desired size before adding new items:
max_pool_queue_size = 20
for i in range(10000):
pool.apply_async(some_func, args=(...))
while pool._taskqueue.qsize() > max_pool_queue_size:
time.sleep(1)