问题
not sure if this is possible, spend some time looking at what seem like similar questions, but still unclear. For a list of website urls, I need to get the html as a starting point.
I have a class that contains a list of these urls and the class returns a custom iterator that helps me iterate through these to get the html (simplified below)
class Url:
def __init__(self, url)
self.url = url
def fetchhtml(self)
import urllib2
response = urllib2.urlopen(self.url)
return response.read()
class MyIterator:
def __init__(self, obj):
self.obj=obj
self.cnt=0
def __iter__(self):
return self
def next(self):
try:
result=self.obj.get(self.cnt)
self.cnt+=1
return result
except IndexError:
raise StopIteration
class Urls:
def __init__(self, url_list = []):
self.list = url_list
def __iter__(self):
return MyIterator(self)
def get(self, index):
return self.list[index]
2 - I want to be able to use like
url_list = [url1, url2, url3]
urls = Urls(url_list)
html_image_list = {url.url: re.search('@src="([^"]+)"', url.fetchhtml()) for url in urls}
3 - problem i have is that I want to batch all the requests rather than having fetchhtml operate sequentially on my list, and once they are done then extract the image list.
Is there ways to to achieve this, maybe use threads and queue? i cannot see how to make the the list comprehension for my object work like this without it running sequentially. Maybe this is the wrong way, but i just want to batch long running requests initiated by operations within a list or dict comprehension. Thankyou in advance
回答1:
you need to use threading
or multiprocessing
.
also, in Python3, there is concurrent.futures.
Take a look at ThreadPoolExecutor
and ProcessPoolExecutor
.
The example in the docs for ThreadPoolExecutor
does almost exactly what you are asking:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the url and contents
def load_url(url, timeout):
conn = urllib.request.urlopen(url, timeout=timeout)
return conn.readall()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
- Note: similar functionality is available for Python 2 via the futures package on PyPI.
来源:https://stackoverflow.com/questions/16760980/how-to-batch-asynchronous-web-requests-performed-using-a-comprehension-in-python