How can I speed up fetching pages with urllib2 in python?

后端 未结 11 1135
野的像风
野的像风 2020-11-28 03:28

I have a script that fetches several web pages and parses the info.

(An example can be seen at http://bluedevilbooks.com/search/?DEPT=MATH&CLASS=103&SEC=01 )

相关标签:
11条回答
  • 2020-11-28 04:04

    Most of the answers focused on fetching multiple pages from different servers at the same time (threading) but not on reusing already open HTTP connection. If OP is making multiple request to the same server/site.

    In urlib2 a separate connection is created with each request which impacts performance and and as a result slower rate of fetching pages. urllib3 solves this problem by using a connection pool. Can read more here urllib3 [Also thread-safe]

    There is also Requests an HTTP library that uses urllib3

    This combined with threading should increase the speed of fetching pages

    0 讨论(0)
  • 2020-11-28 04:05

    Fetching webpages obviously will take a while as you're not accessing anything local. If you have several to access, you could use the threading module to run a couple at once.

    Here's a very crude example

    import threading
    import urllib2
    import time
    
    urls = ['http://docs.python.org/library/threading.html',
            'http://docs.python.org/library/thread.html',
            'http://docs.python.org/library/multiprocessing.html',
            'http://docs.python.org/howto/urllib2.html']
    data1 = []
    data2 = []
    
    class PageFetch(threading.Thread):
        def __init__(self, url, datadump):
            self.url = url
            self.datadump = datadump
            threading.Thread.__init__(self)
        def run(self):
            page = urllib2.urlopen(self.url)
            self.datadump.append(page.read()) # don't do it like this.
    
    print "Starting threaded reads:"
    start = time.clock()
    for url in urls:
        PageFetch(url, data2).start()
    while len(data2) < len(urls): pass # don't do this either.
    print "...took %f seconds" % (time.clock() - start)
    
    print "Starting sequential reads:"
    start = time.clock()
    for url in urls:
        page = urllib2.urlopen(url)
        data1.append(page.read())
    print "...took %f seconds" % (time.clock() - start)
    
    for i,x in enumerate(data1):
        print len(data1[i]), len(data2[i])
    

    This was the output when I ran it:

    Starting threaded reads:
    ...took 2.035579 seconds
    Starting sequential reads:
    ...took 4.307102 seconds
    73127 19923
    19923 59366
    361483 73127
    59366 361483
    

    Grabbing the data from the thread by appending to a list is probably ill-advised (Queue would be better) but it illustrates that there is a difference.

    0 讨论(0)
  • 2020-11-28 04:05

    Please find Python network benchmark script for single connection slowness identification:

    """Python network test."""
    from socket import create_connection
    from time import time
    
    try:
        from urllib2 import urlopen
    except ImportError:
        from urllib.request import urlopen
    
    TIC = time()
    create_connection(('216.58.194.174', 80))
    print('Duration socket IP connection (s): {:.2f}'.format(time() - TIC))
    
    TIC = time()
    create_connection(('google.com', 80))
    print('Duration socket DNS connection (s): {:.2f}'.format(time() - TIC))
    
    TIC = time()
    urlopen('http://216.58.194.174')
    print('Duration urlopen IP connection (s): {:.2f}'.format(time() - TIC))
    
    TIC = time()
    urlopen('http://google.com')
    print('Duration urlopen DNS connection (s): {:.2f}'.format(time() - TIC))
    

    And example of results with Python 3.6:

    Duration socket IP connection (s): 0.02
    Duration socket DNS connection (s): 75.51
    Duration urlopen IP connection (s): 75.88
    Duration urlopen DNS connection (s): 151.42
    

    Python 2.7.13 has very similar results.

    In this case, DNS and urlopen slowness are easily identified.

    0 讨论(0)
  • 2020-11-28 04:06

    Here is an example using python Threads. The other threaded examples here launch a thread per url, which is not very friendly behaviour if it causes too many hits for the server to handle (for example it is common for spiders to have many urls on the same host)

    from threading import Thread
    from urllib2 import urlopen
    from time import time, sleep
    
    WORKERS=1
    urls = ['http://docs.python.org/library/threading.html',
            'http://docs.python.org/library/thread.html',
            'http://docs.python.org/library/multiprocessing.html',
            'http://docs.python.org/howto/urllib2.html']*10
    results = []
    
    class Worker(Thread):
        def run(self):
            while urls:
                url = urls.pop()
                results.append((url, urlopen(url).read()))
    
    start = time()
    threads = [Worker() for i in range(WORKERS)]
    any(t.start() for t in threads)
    
    while len(results)<40:
        sleep(0.1)
    print time()-start
    

    Note: The times given here are for 40 urls and will depend a lot on the speed of your internet connection and the latency to the server. Being in Australia, my ping is > 300ms

    With WORKERS=1 it took 86 seconds to run
    With WORKERS=4 it took 23 seconds to run
    with WORKERS=10 it took 10 seconds to run

    so having 10 threads downloading is 8.6 times as fast as a single thread.

    Here is an upgraded version that uses a Queue. There are at least a couple of advantages.
    1. The urls are requested in the order that they appear in the list
    2. Can use q.join() to detect when the requests have all completed
    3. The results are kept in the same order as the url list

    from threading import Thread
    from urllib2 import urlopen
    from time import time, sleep
    from Queue import Queue
    
    WORKERS=10
    urls = ['http://docs.python.org/library/threading.html',
            'http://docs.python.org/library/thread.html',
            'http://docs.python.org/library/multiprocessing.html',
            'http://docs.python.org/howto/urllib2.html']*10
    results = [None]*len(urls)
    
    def worker():
        while True:
            i, url = q.get()
            # print "requesting ", i, url       # if you want to see what's going on
            results[i]=urlopen(url).read()
            q.task_done()
    
    start = time()
    q = Queue()
    for i in range(WORKERS):
        t=Thread(target=worker)
        t.daemon = True
        t.start()
    
    for i,url in enumerate(urls):
        q.put((i,url))
    q.join()
    print time()-start
    
    0 讨论(0)
  • 2020-11-28 04:07

    Since this question was posted it looks like there's a higher level abstraction available, ThreadPoolExecutor:

    https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example

    The example from there pasted here for convenience:

    import concurrent.futures
    import urllib.request
    
    URLS = ['http://www.foxnews.com/',
            'http://www.cnn.com/',
            'http://europe.wsj.com/',
            'http://www.bbc.co.uk/',
            'http://some-made-up-domain.com/']
    
    # Retrieve a single page and report the url and contents
    def load_url(url, timeout):
        with urllib.request.urlopen(url, timeout=timeout) as conn:
            return conn.read()
    
    # We can use a with statement to ensure threads are cleaned up promptly
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        # Start the load operations and mark each future with its URL
        future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
        for future in concurrent.futures.as_completed(future_to_url):
            url = future_to_url[future]
            try:
                data = future.result()
            except Exception as exc:
                print('%r generated an exception: %s' % (url, exc))
            else:
                print('%r page is %d bytes' % (url, len(data)))
    

    There's also map which I think makes the code easier: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.map

    0 讨论(0)
提交回复
热议问题