I had a strange problem. I have a file of the format:
START
1
2
STOP
lllllllll
START
3
5
6
STOP
and I want to read the lines between
How about:
import itertools
def grouper(n, iterable, fillvalue=None):
# Source: http://docs.python.org/library/itertools.html#recipes
"grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
return itertools.izip_longest(*[iter(iterable)]*n,fillvalue=fillvalue)
def block_generator(file):
with open(file) as lines:
for line in lines:
if line == 'START':
block=list(itertools.takewhile(lambda x:x!='STOP',lines))
yield block
blocks=block_generator(file)
p=multiprocessing.Pool(4)
for chunk in grouper(100,blocks,fillvalue=''):
p.map(my_f,chunk)
Using grouper
will limit the amount of the file consumed by p.map
. Thus the whole file need not be read into memory (fed into the task queue) at once.
I claim above that when you call p.map(func,iterator)
, the entire iterator is consumed immediatedly to fill a task queue. The pool workers then get tasks from the queue and work on the jobs concurrently.
If you look inside pool.py and trace through the definitions, you will see
the _handle_tasks
thread gets items from self._taskqueue
, and enumerates that at once:
for i, task in enumerate(taskseq):
...
put(task)
The conclusion is, the iterator passed to p.map
gets consumed at once. There is no waiting for the one task to end before the next task is gotten from the queue.
As further corroboration, if you run this:
demonstration code:
import multiprocessing as mp
import time
import logging
def foo(x):
time.sleep(1)
return x*x
def blocks():
for x in range(1000):
if x%100==0:
logger.info('Got here')
yield x
logger=mp.log_to_stderr(logging.DEBUG)
logger.setLevel(logging.DEBUG)
pool=mp.Pool()
print pool.map(foo, blocks())
You will see the Got here
message printed 10 times almost immediately, and then a long pause due to the time.sleep(1)
call in foo
. This manifestly shows the iterator is fully consumed long before the pool processes gets around to finishing the tasks.