How to batch asynchronous web requests performed using a comprehension in python?

问题

not sure if this is possible, spend some time looking at what seem like similar questions, but still unclear. For a list of website urls, I need to get the html as a starting point.

I have a class that contains a list of these urls and the class returns a custom iterator that helps me iterate through these to get the html (simplified below)

class Url:
   def __init__(self, url)
      self.url = url

   def fetchhtml(self)
      import urllib2
      response = urllib2.urlopen(self.url)
      return response.read()

class MyIterator:
   def __init__(self, obj):
       self.obj=obj
       self.cnt=0

   def __iter__(self):
       return self

   def next(self):
       try:
           result=self.obj.get(self.cnt)
           self.cnt+=1
           return result
       except IndexError:
           raise StopIteration  

class Urls:
   def __init__(self, url_list = []):
       self.list = url_list

   def __iter__(self):
       return MyIterator(self)

   def get(self, index):
       return self.list[index]

2 - I want to be able to use like

url_list = [url1, url2, url3]
urls = Urls(url_list)
html_image_list = {url.url: re.search('@src="([^"]+)"', url.fetchhtml()) for url in urls}

3 - problem i have is that I want to batch all the requests rather than having fetchhtml operate sequentially on my list, and once they are done then extract the image list.

Is there ways to to achieve this, maybe use threads and queue? i cannot see how to make the the list comprehension for my object work like this without it running sequentially. Maybe this is the wrong way, but i just want to batch long running requests initiated by operations within a list or dict comprehension. Thankyou in advance

回答1:

you need to use threading or multiprocessing.

also, in Python3, there is concurrent.futures. Take a look at ThreadPoolExecutor and ProcessPoolExecutor.

The example in the docs for ThreadPoolExecutor does almost exactly what you are asking:

import concurrent.futures
import urllib.request

URLS = ['http://www.foxnews.com/',
        'http://www.cnn.com/',
        'http://europe.wsj.com/',
        'http://www.bbc.co.uk/',
        'http://some-made-up-domain.com/']

# Retrieve a single page and report the url and contents
def load_url(url, timeout):
    conn = urllib.request.urlopen(url, timeout=timeout)
    return conn.readall()

# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            data = future.result()
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
        else:
            print('%r page is %d bytes' % (url, len(data)))

Note: similar functionality is available for Python 2 via the futures package on PyPI.

来源：https://stackoverflow.com/questions/16760980/how-to-batch-asynchronous-web-requests-performed-using-a-comprehension-in-python

标签

python

multithreading

queue

urllib2

list-comprehension