How can I speed up fetching pages with urllib2 in python?

后端未结

关注

 11  1142

野的像风 2020-11-28 03:28

I have a script that fetches several web pages and parses the info.

(An example can be seen at http://bluedevilbooks.com/search/?DEPT=MATH&CLASS=103&SEC=01 )

11条回答

执笔经年 (楼主)

2020-11-28 03:48

Ray offers an elegant way to do this (in both Python 2 and Python 3). Ray is a library for writing parallel and distributed Python.

Simply define the fetch function with the @ray.remote decorator. Then you can fetch a URL in the background by calling fetch.remote(url).

import ray
import sys

ray.init()

@ray.remote
def fetch(url):
    if sys.version_info >= (3, 0):
        import urllib.request
        return urllib.request.urlopen(url).read()
    else:
        import urllib2
        return urllib2.urlopen(url).read()

urls = ['https://en.wikipedia.org/wiki/Donald_Trump',
        'https://en.wikipedia.org/wiki/Barack_Obama',
        'https://en.wikipedia.org/wiki/George_W._Bush',
        'https://en.wikipedia.org/wiki/Bill_Clinton',
        'https://en.wikipedia.org/wiki/George_H._W._Bush']

# Fetch the webpages in parallel.
results = ray.get([fetch.remote(url) for url in urls])

If you also want to process the webpages in parallel, you can either put the processing code directly into fetch, or you can define a new remote function and compose them together.

@ray.remote
def process(html):
    tokens = html.split()
    return set(tokens)

# Fetch and process the pages in parallel.
results = []
for url in urls:
    results.append(process.remote(fetch.remote(url)))
results = ray.get(results)

If you have a very long list of URLs that you want to fetch, you may wish to issue some tasks and then process them in the order that they complete. You can do this using ray.wait.

urls = 100 * urls  # Pretend we have a long list of URLs.
results = []

in_progress_ids = []

# Start pulling 10 URLs in parallel.
for _ in range(10):
    url = urls.pop()
    in_progress_ids.append(fetch.remote(url))

# Whenever one finishes, start fetching a new one.
while len(in_progress_ids) > 0:
    # Get a result that has finished.
    [ready_id], in_progress_ids = ray.wait(in_progress_ids)
    results.append(ray.get(ready_id))
    # Start a new task.
    if len(urls) > 0:
        in_progress_ids.append(fetch.remote(urls.pop()))

View the Ray documentation.

0 讨论(0)

查看其它11个回答