I have a script that fetches several web pages and parses the info.
(An example can be seen at http://bluedevilbooks.com/search/?DEPT=MATH&CLASS=103&SEC=01 )
Most of the answers focused on fetching multiple pages from different servers at the same time (threading) but not on reusing already open HTTP connection. If OP is making multiple request to the same server/site.
In urlib2 a separate connection is created with each request which impacts performance and and as a result slower rate of fetching pages. urllib3 solves this problem by using a connection pool. Can read more here urllib3 [Also thread-safe]
There is also Requests an HTTP library that uses urllib3
This combined with threading should increase the speed of fetching pages
Fetching webpages obviously will take a while as you're not accessing anything local. If you have several to access, you could use the threading module to run a couple at once.
Here's a very crude example
import threading
import urllib2
import time
urls = ['http://docs.python.org/library/threading.html',
'http://docs.python.org/library/thread.html',
'http://docs.python.org/library/multiprocessing.html',
'http://docs.python.org/howto/urllib2.html']
data1 = []
data2 = []
class PageFetch(threading.Thread):
def __init__(self, url, datadump):
self.url = url
self.datadump = datadump
threading.Thread.__init__(self)
def run(self):
page = urllib2.urlopen(self.url)
self.datadump.append(page.read()) # don't do it like this.
print "Starting threaded reads:"
start = time.clock()
for url in urls:
PageFetch(url, data2).start()
while len(data2) < len(urls): pass # don't do this either.
print "...took %f seconds" % (time.clock() - start)
print "Starting sequential reads:"
start = time.clock()
for url in urls:
page = urllib2.urlopen(url)
data1.append(page.read())
print "...took %f seconds" % (time.clock() - start)
for i,x in enumerate(data1):
print len(data1[i]), len(data2[i])
This was the output when I ran it:
Starting threaded reads:
...took 2.035579 seconds
Starting sequential reads:
...took 4.307102 seconds
73127 19923
19923 59366
361483 73127
59366 361483
Grabbing the data from the thread by appending to a list is probably ill-advised (Queue would be better) but it illustrates that there is a difference.
Please find Python network benchmark script for single connection slowness identification:
"""Python network test."""
from socket import create_connection
from time import time
try:
from urllib2 import urlopen
except ImportError:
from urllib.request import urlopen
TIC = time()
create_connection(('216.58.194.174', 80))
print('Duration socket IP connection (s): {:.2f}'.format(time() - TIC))
TIC = time()
create_connection(('google.com', 80))
print('Duration socket DNS connection (s): {:.2f}'.format(time() - TIC))
TIC = time()
urlopen('http://216.58.194.174')
print('Duration urlopen IP connection (s): {:.2f}'.format(time() - TIC))
TIC = time()
urlopen('http://google.com')
print('Duration urlopen DNS connection (s): {:.2f}'.format(time() - TIC))
And example of results with Python 3.6:
Duration socket IP connection (s): 0.02
Duration socket DNS connection (s): 75.51
Duration urlopen IP connection (s): 75.88
Duration urlopen DNS connection (s): 151.42
Python 2.7.13 has very similar results.
In this case, DNS and urlopen slowness are easily identified.
Here is an example using python Threads
. The other threaded examples here launch a thread per url, which is not very friendly behaviour if it causes too many hits for the server to handle (for example it is common for spiders to have many urls on the same host)
from threading import Thread
from urllib2 import urlopen
from time import time, sleep
WORKERS=1
urls = ['http://docs.python.org/library/threading.html',
'http://docs.python.org/library/thread.html',
'http://docs.python.org/library/multiprocessing.html',
'http://docs.python.org/howto/urllib2.html']*10
results = []
class Worker(Thread):
def run(self):
while urls:
url = urls.pop()
results.append((url, urlopen(url).read()))
start = time()
threads = [Worker() for i in range(WORKERS)]
any(t.start() for t in threads)
while len(results)<40:
sleep(0.1)
print time()-start
Note: The times given here are for 40 urls and will depend a lot on the speed of your internet connection and the latency to the server. Being in Australia, my ping is > 300ms
With WORKERS=1
it took 86 seconds to run
With WORKERS=4
it took 23 seconds to run
with WORKERS=10
it took 10 seconds to run
so having 10 threads downloading is 8.6 times as fast as a single thread.
Here is an upgraded version that uses a Queue. There are at least a couple of advantages.
1. The urls are requested in the order that they appear in the list
2. Can use q.join()
to detect when the requests have all completed
3. The results are kept in the same order as the url list
from threading import Thread
from urllib2 import urlopen
from time import time, sleep
from Queue import Queue
WORKERS=10
urls = ['http://docs.python.org/library/threading.html',
'http://docs.python.org/library/thread.html',
'http://docs.python.org/library/multiprocessing.html',
'http://docs.python.org/howto/urllib2.html']*10
results = [None]*len(urls)
def worker():
while True:
i, url = q.get()
# print "requesting ", i, url # if you want to see what's going on
results[i]=urlopen(url).read()
q.task_done()
start = time()
q = Queue()
for i in range(WORKERS):
t=Thread(target=worker)
t.daemon = True
t.start()
for i,url in enumerate(urls):
q.put((i,url))
q.join()
print time()-start
Since this question was posted it looks like there's a higher level abstraction available, ThreadPoolExecutor
:
https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example
The example from there pasted here for convenience:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the url and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
There's also map
which I think makes the code easier: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.map