How can I speed up fetching pages with urllib2 in python?

后端 未结 11 1142
野的像风
野的像风 2020-11-28 03:28

I have a script that fetches several web pages and parses the info.

(An example can be seen at http://bluedevilbooks.com/search/?DEPT=MATH&CLASS=103&SEC=01 )

11条回答
  •  执笔经年
    2020-11-28 03:48

    Ray offers an elegant way to do this (in both Python 2 and Python 3). Ray is a library for writing parallel and distributed Python.

    Simply define the fetch function with the @ray.remote decorator. Then you can fetch a URL in the background by calling fetch.remote(url).

    import ray
    import sys
    
    ray.init()
    
    @ray.remote
    def fetch(url):
        if sys.version_info >= (3, 0):
            import urllib.request
            return urllib.request.urlopen(url).read()
        else:
            import urllib2
            return urllib2.urlopen(url).read()
    
    urls = ['https://en.wikipedia.org/wiki/Donald_Trump',
            'https://en.wikipedia.org/wiki/Barack_Obama',
            'https://en.wikipedia.org/wiki/George_W._Bush',
            'https://en.wikipedia.org/wiki/Bill_Clinton',
            'https://en.wikipedia.org/wiki/George_H._W._Bush']
    
    # Fetch the webpages in parallel.
    results = ray.get([fetch.remote(url) for url in urls])
    

    If you also want to process the webpages in parallel, you can either put the processing code directly into fetch, or you can define a new remote function and compose them together.

    @ray.remote
    def process(html):
        tokens = html.split()
        return set(tokens)
    
    # Fetch and process the pages in parallel.
    results = []
    for url in urls:
        results.append(process.remote(fetch.remote(url)))
    results = ray.get(results)
    

    If you have a very long list of URLs that you want to fetch, you may wish to issue some tasks and then process them in the order that they complete. You can do this using ray.wait.

    urls = 100 * urls  # Pretend we have a long list of URLs.
    results = []
    
    in_progress_ids = []
    
    # Start pulling 10 URLs in parallel.
    for _ in range(10):
        url = urls.pop()
        in_progress_ids.append(fetch.remote(url))
    
    # Whenever one finishes, start fetching a new one.
    while len(in_progress_ids) > 0:
        # Get a result that has finished.
        [ready_id], in_progress_ids = ray.wait(in_progress_ids)
        results.append(ray.get(ready_id))
        # Start a new task.
        if len(urls) > 0:
            in_progress_ids.append(fetch.remote(urls.pop()))
    

    View the Ray documentation.

提交回复
热议问题