Python: simple async download of url content?

后端未结

关注

 10  1549

I have a web.py server that responds to various user requests. One of these requests involves downloading and analyzing a series of web pages.

Is there a simple way

相关标签:

10条回答

花落未央

2020-12-15 11:57

Actually you can integrate twisted with web.py. I'm not really sure how as I've only done it with django (used twisted with it).

0 讨论(0)
发布评论:

提交评论
- 加载中...
耶瑟儿～

2020-12-15 11:58

I'd just build a service in twisted that did that concurrent fetch and analysis and access that from web.py as a simple http request.

0 讨论(0)
发布评论:

提交评论
- 加载中...

北海茫月

2020-12-15 12:00

You might be able to use urllib to download the files and the Queue module to manage a number of worker threads. e.g:

import urllib
from threading import Thread
from Queue import Queue

NUM_WORKERS = 20

class Dnld:
    def __init__(self):
        self.Q = Queue()
        for i in xrange(NUM_WORKERS):
            t = Thread(target=self.worker)
            t.setDaemon(True)
            t.start()

    def worker(self):
        while 1:
            url, Q = self.Q.get()
            try:
                f = urllib.urlopen(url)
                Q.put(('ok', url, f.read()))
                f.close()
            except Exception, e:
                Q.put(('error', url, e))
                try: f.close() # clean up
                except: pass

    def download_urls(self, L):
        Q = Queue() # Create a second queue so the worker 
                    # threads can send the data back again
        for url in L:
            # Add the URLs in `L` to be downloaded asynchronously
            self.Q.put((url, Q))

        rtn = []
        for i in xrange(len(L)):
            # Get the data as it arrives, raising 
            # any exceptions if they occur
            status, url, data = Q.get()
            if status == 'ok':
                rtn.append((url, data))
            else:
                raise data
        return rtn

inst = Dnld()
for url, data in inst.download_urls(['http://www.google.com']*2):
    print url, data

0 讨论(0)

广开言路

2020-12-15 12:00

I don't know if this will exactly work, but it looks like it might: EvServer: Python Asynchronous WSGI Server has a web.py interface and can do comet style push to the browser client.

If that isn't right, maybe you can use the Concurrence HTTP client for async download of the pages and figure out how to serve them to browser via ajax or comet.

0 讨论(0)
发布评论:

提交评论
- 加载中...
暗喜

2020-12-15 12:02

Use the async http client which uses asynchat and asyncore. http://sourceforge.net/projects/asynchttp/files/asynchttp-production/asynchttp.py-1.0/asynchttp.py/download

0 讨论(0)
发布评论:

提交评论
- 加载中...
梦如初夏

2020-12-15 12:05
Here is an interesting piece of code. I didn't use it myself, but it looks nice ;)

https://github.com/facebook/tornado/blob/master/tornado/httpclient.py

Low level AsyncHTTPClient:

"An non-blocking HTTP client backed with pycurl. Example usage:"
```
import ioloop

def handle_request(response):
    if response.error:
        print "Error:", response.error
    else:
        print response.body
    ioloop.IOLoop.instance().stop()

http_client = httpclient.AsyncHTTPClient()
http_client.fetch("http://www.google.com/", handle_request)
ioloop.IOLoop.instance().start()
```
" fetch() can take a string URL or an HTTPRequest instance, which offers more options, like executing POST/PUT/DELETE requests.

The keyword argument max_clients to the AsyncHTTPClient constructor determines the maximum number of simultaneous fetch() operations that can execute in parallel on each IOLoop. "

There is also new implementation in progress: https://github.com/facebook/tornado/blob/master/tornado/simple_httpclient.py "Non-blocking HTTP client with no external dependencies. ... This class is still in development and not yet recommended for production use."
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页