I have a script that fetches several web pages and parses the info.
(An example can be seen at http://bluedevilbooks.com/search/?DEPT=MATH&CLASS=103&SEC=01 )
Ray offers an elegant way to do this (in both Python 2 and Python 3). Ray is a library for writing parallel and distributed Python.
Simply define the fetch
function with the @ray.remote
decorator. Then you can fetch a URL in the background by calling fetch.remote(url)
.
import ray
import sys
ray.init()
@ray.remote
def fetch(url):
if sys.version_info >= (3, 0):
import urllib.request
return urllib.request.urlopen(url).read()
else:
import urllib2
return urllib2.urlopen(url).read()
urls = ['https://en.wikipedia.org/wiki/Donald_Trump',
'https://en.wikipedia.org/wiki/Barack_Obama',
'https://en.wikipedia.org/wiki/George_W._Bush',
'https://en.wikipedia.org/wiki/Bill_Clinton',
'https://en.wikipedia.org/wiki/George_H._W._Bush']
# Fetch the webpages in parallel.
results = ray.get([fetch.remote(url) for url in urls])
If you also want to process the webpages in parallel, you can either put the processing code directly into fetch
, or you can define a new remote function and compose them together.
@ray.remote
def process(html):
tokens = html.split()
return set(tokens)
# Fetch and process the pages in parallel.
results = []
for url in urls:
results.append(process.remote(fetch.remote(url)))
results = ray.get(results)
If you have a very long list of URLs that you want to fetch, you may wish to issue some tasks and then process them in the order that they complete. You can do this using ray.wait
.
urls = 100 * urls # Pretend we have a long list of URLs.
results = []
in_progress_ids = []
# Start pulling 10 URLs in parallel.
for _ in range(10):
url = urls.pop()
in_progress_ids.append(fetch.remote(url))
# Whenever one finishes, start fetching a new one.
while len(in_progress_ids) > 0:
# Get a result that has finished.
[ready_id], in_progress_ids = ray.wait(in_progress_ids)
results.append(ray.get(ready_id))
# Start a new task.
if len(urls) > 0:
in_progress_ids.append(fetch.remote(urls.pop()))
View the Ray documentation.