How can I speed up fetching pages with urllib2 in python?

后端 未结 11 1171
野的像风
野的像风 2020-11-28 03:28

I have a script that fetches several web pages and parses the info.

(An example can be seen at http://bluedevilbooks.com/search/?DEPT=MATH&CLASS=103&SEC=01 )

11条回答
  •  陌清茗
    陌清茗 (楼主)
    2020-11-28 04:05

    Fetching webpages obviously will take a while as you're not accessing anything local. If you have several to access, you could use the threading module to run a couple at once.

    Here's a very crude example

    import threading
    import urllib2
    import time
    
    urls = ['http://docs.python.org/library/threading.html',
            'http://docs.python.org/library/thread.html',
            'http://docs.python.org/library/multiprocessing.html',
            'http://docs.python.org/howto/urllib2.html']
    data1 = []
    data2 = []
    
    class PageFetch(threading.Thread):
        def __init__(self, url, datadump):
            self.url = url
            self.datadump = datadump
            threading.Thread.__init__(self)
        def run(self):
            page = urllib2.urlopen(self.url)
            self.datadump.append(page.read()) # don't do it like this.
    
    print "Starting threaded reads:"
    start = time.clock()
    for url in urls:
        PageFetch(url, data2).start()
    while len(data2) < len(urls): pass # don't do this either.
    print "...took %f seconds" % (time.clock() - start)
    
    print "Starting sequential reads:"
    start = time.clock()
    for url in urls:
        page = urllib2.urlopen(url)
        data1.append(page.read())
    print "...took %f seconds" % (time.clock() - start)
    
    for i,x in enumerate(data1):
        print len(data1[i]), len(data2[i])
    

    This was the output when I ran it:

    Starting threaded reads:
    ...took 2.035579 seconds
    Starting sequential reads:
    ...took 4.307102 seconds
    73127 19923
    19923 59366
    361483 73127
    59366 361483
    

    Grabbing the data from the thread by appending to a list is probably ill-advised (Queue would be better) but it illustrates that there is a difference.

提交回复
热议问题