As the title suggests, I\'m working on a site written in python and it makes several calls to the urllib2 module to read websites. I then parse them with BeautifulSoup.
Not sure why nobody mentions multiprocessing (if anyone knows why this might be a bad idea, let me know):
import multiprocessing
from urllib2 import urlopen
URLS = [....]
def get_content(url):
return urlopen(url).read()
pool = multiprocessing.Pool(processes=8) # play with ``processes`` for best results
results = pool.map(get_content, URLS) # This line blocks, look at map_async
# for non-blocking map() call
pool.close() # the process pool no longer accepts new tasks
pool.join() # join the processes: this blocks until all URLs are processed
for result in results:
# do something
There are a few caveats with multiprocessing
pools. First, unlike threads, these are completely new Python processes (interpreter). While it's not subject to global interpreter lock, it means you are limited in what you can pass across to the new process.
You cannot pass lambdas and functions that are defined dynamically. The function that is used in the map()
call must be defined in your module in a way that allows the other process to import it.
The Pool.map()
, which is the most straightforward way to process multiple tasks concurrently, doesn't provide a way to pass multiple arguments, so you may need to write wrapper functions or change function signatures, and/or pass multiple arguments as part of the iterable that is being mapped.
You cannot have child processes spawn new ones. Only the parent can spawn child processes. This means you have to carefully plan and benchmark (and sometimes write multiple versions of your code) in order to determine what the most effective use of processes would be.
Drawbacks notwithstanding, I find multiprocessing to be one of the most straightforward ways to do concurrent blocking calls. You can also combine multiprocessing and threads (afaik, but please correct me if I'm wrong), or combine multiprocessing with green threads.
1) Are you opening the same site many times, or many different site? If many different sites, I think urllib2 is good. If doing the same site over and over again, I have had some personal luck with urllib3 http://code.google.com/p/urllib3/
2) BeautifulSoup is easy to use, but is pretty slow. If you do have to use it, make sure to decompose your tags to get rid of memory leaks.. or it will likely lead to memory issues (did for me).
What do your memory and cpu look like? If you are maxing your CPU, make sure you are using real heavyweight threads, so you can run on more than 1 core.
I'm rewriting Dumb Guy's code below using modern Python modules like threading
and Queue
.
import threading, urllib2
import Queue
urls_to_load = [
'http://stackoverflow.com/',
'http://slashdot.org/',
'http://www.archive.org/',
'http://www.yahoo.co.jp/',
]
def read_url(url, queue):
data = urllib2.urlopen(url).read()
print('Fetched %s from %s' % (len(data), url))
queue.put(data)
def fetch_parallel():
result = Queue.Queue()
threads = [threading.Thread(target=read_url, args = (url,result)) for url in urls_to_load]
for t in threads:
t.start()
for t in threads:
t.join()
return result
def fetch_sequencial():
result = Queue.Queue()
for url in urls_to_load:
read_url(url,result)
return result
Best time for find_sequencial()
is 2s. Best time for fetch_parallel()
is 0.9s.
Also it is incorrect to say thread
is useless in Python because of GIL. This is one of those case when thread is useful in Python because the the threads are blocked on I/O. As you can see in my result the parallel case is 2 times faster.