Async query database for keys to use in multiple requests

这一生的挚爱 提交于 2019-12-04 21:53:00

You've provided no use case for async database queries to be a necessity. I'm assuming you cannot begin to scrape your URLs unless you query the database first? If that's the case then you're better off just doing the query synchronously, iterate over the query results, extract what you need, then yield Request objects. It makes little sense to query a db asynchronously and just sit around waiting for the query to finish.

You can let the callback for the Deferred object pass the urls to a generator of some sort. The generator will then convert any received urls into scrapy Request objects and yield them. Below is an example using the code you linked (not tested):

import scrapy
from Queue import Queue
from pdb import set_trace as st
from twisted.internet.defer import Deferred, inlineCallbacks


class ExampleSpider(scrapy.Spider):
    name = 'example'

    def __init__(self):
        self.urls = Queue()
        self.stop = False
        self.requests = request_generator()
        self.deferred = deferred_generator()

    def deferred_generator(self):
        d = Deferred()
        d.addCallback(self.deferred_callback)
        yield d

    def request_generator(self):
        while not self.stop:
            url = self.urls.get()
            yield scrapy.Request(url=url, callback=self.parse)

    def start_requests(self):
        return self.requests.next()

    def parse(self, response):
        st()

        # when you need to parse the next url from the callback
        yield self.requests.next()

    @static_method
    def deferred_callback(url):
        self.urls.put(url)
        if no_more_urls():
            self.stop = True

Don't forget to stop the request generator when you're done.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!