Python urllib2.urlopen() is slow, need a better way to read several urls

前端 未结 9 2263
谎友^
谎友^ 2020-11-28 04:48

As the title suggests, I\'m working on a site written in python and it makes several calls to the urllib2 module to read websites. I then parse them with BeautifulSoup.

9条回答
  •  感动是毒
    2020-11-28 05:18

    Not sure why nobody mentions multiprocessing (if anyone knows why this might be a bad idea, let me know):

    import multiprocessing
    from urllib2 import urlopen
    
    URLS = [....]
    
    def get_content(url):
        return urlopen(url).read()
    
    
    pool = multiprocessing.Pool(processes=8)  # play with ``processes`` for best results
    results = pool.map(get_content, URLS) # This line blocks, look at map_async 
                                          # for non-blocking map() call
    pool.close()  # the process pool no longer accepts new tasks
    pool.join()   # join the processes: this blocks until all URLs are processed
    for result in results:
       # do something
    

    There are a few caveats with multiprocessing pools. First, unlike threads, these are completely new Python processes (interpreter). While it's not subject to global interpreter lock, it means you are limited in what you can pass across to the new process.

    You cannot pass lambdas and functions that are defined dynamically. The function that is used in the map() call must be defined in your module in a way that allows the other process to import it.

    The Pool.map(), which is the most straightforward way to process multiple tasks concurrently, doesn't provide a way to pass multiple arguments, so you may need to write wrapper functions or change function signatures, and/or pass multiple arguments as part of the iterable that is being mapped.

    You cannot have child processes spawn new ones. Only the parent can spawn child processes. This means you have to carefully plan and benchmark (and sometimes write multiple versions of your code) in order to determine what the most effective use of processes would be.

    Drawbacks notwithstanding, I find multiprocessing to be one of the most straightforward ways to do concurrent blocking calls. You can also combine multiprocessing and threads (afaik, but please correct me if I'm wrong), or combine multiprocessing with green threads.

提交回复
热议问题