I have a script that fetches several web pages and parses the info.
(An example can be seen at http://bluedevilbooks.com/search/?DEPT=MATH&CLASS=103&SEC=01 )
Fetching webpages obviously will take a while as you're not accessing anything local. If you have several to access, you could use the threading module to run a couple at once.
Here's a very crude example
import threading
import urllib2
import time
urls = ['http://docs.python.org/library/threading.html',
'http://docs.python.org/library/thread.html',
'http://docs.python.org/library/multiprocessing.html',
'http://docs.python.org/howto/urllib2.html']
data1 = []
data2 = []
class PageFetch(threading.Thread):
def __init__(self, url, datadump):
self.url = url
self.datadump = datadump
threading.Thread.__init__(self)
def run(self):
page = urllib2.urlopen(self.url)
self.datadump.append(page.read()) # don't do it like this.
print "Starting threaded reads:"
start = time.clock()
for url in urls:
PageFetch(url, data2).start()
while len(data2) < len(urls): pass # don't do this either.
print "...took %f seconds" % (time.clock() - start)
print "Starting sequential reads:"
start = time.clock()
for url in urls:
page = urllib2.urlopen(url)
data1.append(page.read())
print "...took %f seconds" % (time.clock() - start)
for i,x in enumerate(data1):
print len(data1[i]), len(data2[i])
This was the output when I ran it:
Starting threaded reads:
...took 2.035579 seconds
Starting sequential reads:
...took 4.307102 seconds
73127 19923
19923 59366
361483 73127
59366 361483
Grabbing the data from the thread by appending to a list is probably ill-advised (Queue would be better) but it illustrates that there is a difference.