Python requests with multithreading

后端 未结 1 1021
無奈伤痛
無奈伤痛 2020-12-05 06:58

I\'ve been trying to build a scraper with multithreading functionality past two days. Somehow I still couldn\'t manage it. At first I tried regular multithreading approach w

1条回答
  •  情深已故
    2020-12-05 07:39

    Install the grequests module which works with gevent (requests is not designed for async):

    pip install grequests
    

    Then change the code to something like this:

    import grequests
    
    class Test:
        def __init__(self):
            self.urls = [
                'http://www.example.com',
                'http://www.google.com', 
                'http://www.yahoo.com',
                'http://www.stackoverflow.com/',
                'http://www.reddit.com/'
            ]
    
        def exception(self, request, exception):
            print "Problem: {}: {}".format(request.url, exception)
    
        def async(self):
            results = grequests.map((grequests.get(u) for u in self.urls), exception_handler=self.exception, size=5)
            print results
    
    test = Test()
    test.async()
    

    This is officially recommended by the requests project:

    Blocking Or Non-Blocking?

    With the default Transport Adapter in place, Requests does not provide any kind of non-blocking IO. The Response.content property will block until the entire response has been downloaded. If you require more granularity, the streaming features of the library (see Streaming Requests) allow you to retrieve smaller quantities of the response at a time. However, these calls will still block.

    If you are concerned about the use of blocking IO, there are lots of projects out there that combine Requests with one of Python's asynchronicity frameworks. Two excellent examples are grequests and requests-futures.

    Using this method gives me a noticable performance increase with 10 URLs: 0.877s vs 3.852s with your original method.

    0 讨论(0)
提交回复
热议问题