Python urllib2.urlopen() is slow, need a better way to read several urls

前端未结

关注

 9  2237

谎友^ 2020-11-28 04:48

As the title suggests, I\'m working on a site written in python and it makes several calls to the urllib2 module to read websites. I then parse them with BeautifulSoup.

9条回答

感情败类 (楼主)

2020-11-28 05:20

I'm rewriting Dumb Guy's code below using modern Python modules like threading and Queue.

import threading, urllib2
import Queue

urls_to_load = [
'http://stackoverflow.com/',
'http://slashdot.org/',
'http://www.archive.org/',
'http://www.yahoo.co.jp/',
]

def read_url(url, queue):
    data = urllib2.urlopen(url).read()
    print('Fetched %s from %s' % (len(data), url))
    queue.put(data)

def fetch_parallel():
    result = Queue.Queue()
    threads = [threading.Thread(target=read_url, args = (url,result)) for url in urls_to_load]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    return result

def fetch_sequencial():
    result = Queue.Queue()
    for url in urls_to_load:
        read_url(url,result)
    return result

Best time for find_sequencial() is 2s. Best time for fetch_parallel() is 0.9s.

Also it is incorrect to say thread is useless in Python because of GIL. This is one of those case when thread is useful in Python because the the threads are blocked on I/O. As you can see in my result the parallel case is 2 times faster.

0 讨论(0)

查看其它9个回答