I have a web.py server that responds to various user requests. One of these requests involves downloading and analyzing a series of web pages.
Is there a simple way
Actually you can integrate twisted with web.py. I'm not really sure how as I've only done it with django (used twisted with it).
I'd just build a service in twisted that did that concurrent fetch and analysis and access that from web.py as a simple http request.
You might be able to use urllib to download the files and the Queue module to manage a number of worker threads. e.g:
import urllib
from threading import Thread
from Queue import Queue
NUM_WORKERS = 20
class Dnld:
def __init__(self):
self.Q = Queue()
for i in xrange(NUM_WORKERS):
t = Thread(target=self.worker)
t.setDaemon(True)
t.start()
def worker(self):
while 1:
url, Q = self.Q.get()
try:
f = urllib.urlopen(url)
Q.put(('ok', url, f.read()))
f.close()
except Exception, e:
Q.put(('error', url, e))
try: f.close() # clean up
except: pass
def download_urls(self, L):
Q = Queue() # Create a second queue so the worker
# threads can send the data back again
for url in L:
# Add the URLs in `L` to be downloaded asynchronously
self.Q.put((url, Q))
rtn = []
for i in xrange(len(L)):
# Get the data as it arrives, raising
# any exceptions if they occur
status, url, data = Q.get()
if status == 'ok':
rtn.append((url, data))
else:
raise data
return rtn
inst = Dnld()
for url, data in inst.download_urls(['http://www.google.com']*2):
print url, data
I don't know if this will exactly work, but it looks like it might: EvServer: Python Asynchronous WSGI Server has a web.py interface and can do comet style push to the browser client.
If that isn't right, maybe you can use the Concurrence HTTP client for async download of the pages and figure out how to serve them to browser via ajax or comet.
Use the async http client which uses asynchat and asyncore. http://sourceforge.net/projects/asynchttp/files/asynchttp-production/asynchttp.py-1.0/asynchttp.py/download
Here is an interesting piece of code. I didn't use it myself, but it looks nice ;)
https://github.com/facebook/tornado/blob/master/tornado/httpclient.py
Low level AsyncHTTPClient:
"An non-blocking HTTP client backed with pycurl. Example usage:"
import ioloop
def handle_request(response):
if response.error:
print "Error:", response.error
else:
print response.body
ioloop.IOLoop.instance().stop()
http_client = httpclient.AsyncHTTPClient()
http_client.fetch("http://www.google.com/", handle_request)
ioloop.IOLoop.instance().start()
" fetch() can take a string URL or an HTTPRequest instance, which offers more options, like executing POST/PUT/DELETE requests.
The keyword argument max_clients to the AsyncHTTPClient constructor determines the maximum number of simultaneous fetch() operations that can execute in parallel on each IOLoop. "
There is also new implementation in progress: https://github.com/facebook/tornado/blob/master/tornado/simple_httpclient.py "Non-blocking HTTP client with no external dependencies. ... This class is still in development and not yet recommended for production use."