How can I speed up fetching pages with urllib2 in python?

后端未结

关注

 11  1151

野的像风

I have a script that fetches several web pages and parses the info.

(An example can be seen at http://bluedevilbooks.com/search/?DEPT=MATH&CLASS=103&SEC=01 )

相关标签:

11条回答

闹比i

2020-11-28 04:04

Most of the answers focused on fetching multiple pages from different servers at the same time (threading) but not on reusing already open HTTP connection. If OP is making multiple request to the same server/site.

In urlib2 a separate connection is created with each request which impacts performance and and as a result slower rate of fetching pages. urllib3 solves this problem by using a connection pool. Can read more here urllib3 [Also thread-safe]

There is also Requests an HTTP library that uses urllib3

This combined with threading should increase the speed of fetching pages

0 讨论(0)
发布评论:

提交评论
- 加载中...

陌清茗

2020-11-28 04:05

Fetching webpages obviously will take a while as you're not accessing anything local. If you have several to access, you could use the threading module to run a couple at once.

Here's a very crude example

import threading
import urllib2
import time

urls = ['http://docs.python.org/library/threading.html',
        'http://docs.python.org/library/thread.html',
        'http://docs.python.org/library/multiprocessing.html',
        'http://docs.python.org/howto/urllib2.html']
data1 = []
data2 = []

class PageFetch(threading.Thread):
    def __init__(self, url, datadump):
        self.url = url
        self.datadump = datadump
        threading.Thread.__init__(self)
    def run(self):
        page = urllib2.urlopen(self.url)
        self.datadump.append(page.read()) # don't do it like this.

print "Starting threaded reads:"
start = time.clock()
for url in urls:
    PageFetch(url, data2).start()
while len(data2) < len(urls): pass # don't do this either.
print "...took %f seconds" % (time.clock() - start)

print "Starting sequential reads:"
start = time.clock()
for url in urls:
    page = urllib2.urlopen(url)
    data1.append(page.read())
print "...took %f seconds" % (time.clock() - start)

for i,x in enumerate(data1):
    print len(data1[i]), len(data2[i])

This was the output when I ran it:

Starting threaded reads:
...took 2.035579 seconds
Starting sequential reads:
...took 4.307102 seconds
73127 19923
19923 59366
361483 73127
59366 361483

Grabbing the data from the thread by appending to a list is probably ill-advised (Queue would be better) but it illustrates that there is a difference.

0 讨论(0)

自闭症患者

2020-11-28 04:05

Please find Python network benchmark script for single connection slowness identification:

"""Python network test."""
from socket import create_connection
from time import time

try:
    from urllib2 import urlopen
except ImportError:
    from urllib.request import urlopen

TIC = time()
create_connection(('216.58.194.174', 80))
print('Duration socket IP connection (s): {:.2f}'.format(time() - TIC))

TIC = time()
create_connection(('google.com', 80))
print('Duration socket DNS connection (s): {:.2f}'.format(time() - TIC))

TIC = time()
urlopen('http://216.58.194.174')
print('Duration urlopen IP connection (s): {:.2f}'.format(time() - TIC))

TIC = time()
urlopen('http://google.com')
print('Duration urlopen DNS connection (s): {:.2f}'.format(time() - TIC))

And example of results with Python 3.6:

Duration socket IP connection (s): 0.02
Duration socket DNS connection (s): 75.51
Duration urlopen IP connection (s): 75.88
Duration urlopen DNS connection (s): 151.42

Python 2.7.13 has very similar results.

In this case, DNS and urlopen slowness are easily identified.

0 讨论(0)

无人共我

2020-11-28 04:06

Here is an example using python Threads. The other threaded examples here launch a thread per url, which is not very friendly behaviour if it causes too many hits for the server to handle (for example it is common for spiders to have many urls on the same host)

from threading import Thread
from urllib2 import urlopen
from time import time, sleep

WORKERS=1
urls = ['http://docs.python.org/library/threading.html',
        'http://docs.python.org/library/thread.html',
        'http://docs.python.org/library/multiprocessing.html',
        'http://docs.python.org/howto/urllib2.html']*10
results = []

class Worker(Thread):
    def run(self):
        while urls:
            url = urls.pop()
            results.append((url, urlopen(url).read()))

start = time()
threads = [Worker() for i in range(WORKERS)]
any(t.start() for t in threads)

while len(results)<40:
    sleep(0.1)
print time()-start

Note: The times given here are for 40 urls and will depend a lot on the speed of your internet connection and the latency to the server. Being in Australia, my ping is > 300ms

With WORKERS=1 it took 86 seconds to run
With WORKERS=4 it took 23 seconds to run
with WORKERS=10 it took 10 seconds to run

so having 10 threads downloading is 8.6 times as fast as a single thread.

Here is an upgraded version that uses a Queue. There are at least a couple of advantages.
1. The urls are requested in the order that they appear in the list
2. Can use q.join() to detect when the requests have all completed
3. The results are kept in the same order as the url list

from threading import Thread
from urllib2 import urlopen
from time import time, sleep
from Queue import Queue

WORKERS=10
urls = ['http://docs.python.org/library/threading.html',
        'http://docs.python.org/library/thread.html',
        'http://docs.python.org/library/multiprocessing.html',
        'http://docs.python.org/howto/urllib2.html']*10
results = [None]*len(urls)

def worker():
    while True:
        i, url = q.get()
        # print "requesting ", i, url       # if you want to see what's going on
        results[i]=urlopen(url).read()
        q.task_done()

start = time()
q = Queue()
for i in range(WORKERS):
    t=Thread(target=worker)
    t.daemon = True
    t.start()

for i,url in enumerate(urls):
    q.put((i,url))
q.join()
print time()-start

0 讨论(0)

一个人的身影

2020-11-28 04:07

Since this question was posted it looks like there's a higher level abstraction available, ThreadPoolExecutor:

https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example

The example from there pasted here for convenience:

import concurrent.futures
import urllib.request

URLS = ['http://www.foxnews.com/',
        'http://www.cnn.com/',
        'http://europe.wsj.com/',
        'http://www.bbc.co.uk/',
        'http://some-made-up-domain.com/']

# Retrieve a single page and report the url and contents
def load_url(url, timeout):
    with urllib.request.urlopen(url, timeout=timeout) as conn:
        return conn.read()

# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            data = future.result()
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
        else:
            print('%r page is %d bytes' % (url, len(data)))

There's also map which I think makes the code easier: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.map

0 讨论(0)

上一页 1 2