As the title suggests, I\'m working on a site written in python and it makes several calls to the urllib2 module to read websites. I then parse them with BeautifulSoup.
Not sure why nobody mentions multiprocessing (if anyone knows why this might be a bad idea, let me know):
import multiprocessing
from urllib2 import urlopen
URLS = [....]
def get_content(url):
return urlopen(url).read()
pool = multiprocessing.Pool(processes=8) # play with ``processes`` for best results
results = pool.map(get_content, URLS) # This line blocks, look at map_async
# for non-blocking map() call
pool.close() # the process pool no longer accepts new tasks
pool.join() # join the processes: this blocks until all URLs are processed
for result in results:
# do something
There are a few caveats with multiprocessing pools. First, unlike threads, these are completely new Python processes (interpreter). While it's not subject to global interpreter lock, it means you are limited in what you can pass across to the new process.
You cannot pass lambdas and functions that are defined dynamically. The function that is used in the map() call must be defined in your module in a way that allows the other process to import it.
The Pool.map(), which is the most straightforward way to process multiple tasks concurrently, doesn't provide a way to pass multiple arguments, so you may need to write wrapper functions or change function signatures, and/or pass multiple arguments as part of the iterable that is being mapped.
You cannot have child processes spawn new ones. Only the parent can spawn child processes. This means you have to carefully plan and benchmark (and sometimes write multiple versions of your code) in order to determine what the most effective use of processes would be.
Drawbacks notwithstanding, I find multiprocessing to be one of the most straightforward ways to do concurrent blocking calls. You can also combine multiprocessing and threads (afaik, but please correct me if I'm wrong), or combine multiprocessing with green threads.