Python urllib2.urlopen() is slow, need a better way to read several urls

前端 未结 9 2237
谎友^
谎友^ 2020-11-28 04:48

As the title suggests, I\'m working on a site written in python and it makes several calls to the urllib2 module to read websites. I then parse them with BeautifulSoup.

9条回答
  •  感情败类
    2020-11-28 05:20

    I'm rewriting Dumb Guy's code below using modern Python modules like threading and Queue.

    import threading, urllib2
    import Queue
    
    urls_to_load = [
    'http://stackoverflow.com/',
    'http://slashdot.org/',
    'http://www.archive.org/',
    'http://www.yahoo.co.jp/',
    ]
    
    def read_url(url, queue):
        data = urllib2.urlopen(url).read()
        print('Fetched %s from %s' % (len(data), url))
        queue.put(data)
    
    def fetch_parallel():
        result = Queue.Queue()
        threads = [threading.Thread(target=read_url, args = (url,result)) for url in urls_to_load]
        for t in threads:
            t.start()
        for t in threads:
            t.join()
        return result
    
    def fetch_sequencial():
        result = Queue.Queue()
        for url in urls_to_load:
            read_url(url,result)
        return result
    

    Best time for find_sequencial() is 2s. Best time for fetch_parallel() is 0.9s.

    Also it is incorrect to say thread is useless in Python because of GIL. This is one of those case when thread is useful in Python because the the threads are blocked on I/O. As you can see in my result the parallel case is 2 times faster.

提交回复
热议问题