How to batch asynchronous web requests performed using a comprehension in python?

拟墨画扇 提交于 2019-12-12 00:51:41

问题


not sure if this is possible, spend some time looking at what seem like similar questions, but still unclear. For a list of website urls, I need to get the html as a starting point.

I have a class that contains a list of these urls and the class returns a custom iterator that helps me iterate through these to get the html (simplified below)

class Url:
   def __init__(self, url)
      self.url = url

   def fetchhtml(self)
      import urllib2
      response = urllib2.urlopen(self.url)
      return response.read()

class MyIterator:
   def __init__(self, obj):
       self.obj=obj
       self.cnt=0

   def __iter__(self):
       return self

   def next(self):
       try:
           result=self.obj.get(self.cnt)
           self.cnt+=1
           return result
       except IndexError:
           raise StopIteration  

class Urls:
   def __init__(self, url_list = []):
       self.list = url_list

   def __iter__(self):
       return MyIterator(self)

   def get(self, index):
       return self.list[index]

2 - I want to be able to use like

url_list = [url1, url2, url3]
urls = Urls(url_list)
html_image_list = {url.url: re.search('@src="([^"]+)"', url.fetchhtml()) for url in urls}

3 - problem i have is that I want to batch all the requests rather than having fetchhtml operate sequentially on my list, and once they are done then extract the image list.

Is there ways to to achieve this, maybe use threads and queue? i cannot see how to make the the list comprehension for my object work like this without it running sequentially. Maybe this is the wrong way, but i just want to batch long running requests initiated by operations within a list or dict comprehension. Thankyou in advance


回答1:


you need to use threading or multiprocessing.

also, in Python3, there is concurrent.futures. Take a look at ThreadPoolExecutor and ProcessPoolExecutor.

The example in the docs for ThreadPoolExecutor does almost exactly what you are asking:

import concurrent.futures
import urllib.request

URLS = ['http://www.foxnews.com/',
        'http://www.cnn.com/',
        'http://europe.wsj.com/',
        'http://www.bbc.co.uk/',
        'http://some-made-up-domain.com/']

# Retrieve a single page and report the url and contents
def load_url(url, timeout):
    conn = urllib.request.urlopen(url, timeout=timeout)
    return conn.readall()

# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            data = future.result()
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
        else:
            print('%r page is %d bytes' % (url, len(data)))
  • Note: similar functionality is available for Python 2 via the futures package on PyPI.


来源:https://stackoverflow.com/questions/16760980/how-to-batch-asynchronous-web-requests-performed-using-a-comprehension-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!