How to use threading in Scrapy/Twisted, i.e. how to do async calls to blocking code in response callbacks?

冷暖自知 提交于 2020-01-01 15:39:52

问题


I need run some multi-thread\multiprocessing work (because I have some library which uses blocking call) in Scrapy, and after its completion put back Request to Scrapy engine.

I need something like this:

def blocking_call(self, html):
    # ....
    # do some work in blocking call
    return Request(url)

def parse(self, response):
    return self.blocking_call(response.body)

How I can do that? I think I should to use Twisted reactor and Deferred object. But Scrapy parse callback must return only None or Request or BaseItem object.


回答1:


If you want to return a Deferred that fires after your blocking operation has finished running in one of the reactor's thread pool threads, use deferToThreadPool:

from twisted.internet.threads import deferToThreadPool
from twisted.internet import reactor

...

    def parse(self, response):
        return deferToThreadPool(
            reactor, reactor.getThreadPool(), self.blocking_call, response.body)



回答2:


Based on answer from @Jean-Paul Calderone I did some investigation and testing and here is what I have found out.

Internally scrapy uses Twisted framework for managing request/response sync and async calls.

Scrapy spawns requests (crawling) in async manner, but processing responses (our custom parse callback functions) are done synchronous. So if you have blocking call in a callback, it will block the whole engine.

Hopefully this can be changed. When processing Deferred response callback result, Twisted handles the case (twisted.internet.defer.Deferred source) if Deferred object returns other Deferred object. In that case Twisted yields new async call.

Basically, if we return Deferred object from our response callback, this will change nature of response callback call from sync to async. For that we can use method deferToThread ( internally calls deferToThreadPool(reactor, reactor.getThreadPool()... - which was used in @Jean-Paul Calderone code example).

The working code example is:

from twisted.internet.threads import deferToThread
from twisted.internet import reactor

class SpiderWithBlocking(...):
    ...
    def parse(self, response):
        return deferToThread(reactor, self.blocking_call, response.body)

    def blocking_call(self, html):
        # ....
        # do some work in blocking call
        return Request(url)

Additionally, only callbacks can return Deferred objects, but start_requests can not (scrapy logic).



来源:https://stackoverflow.com/questions/25842469/how-to-use-threading-in-scrapy-twisted-i-e-how-to-do-async-calls-to-blocking-c

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!