How to schedule Scrapy crawl execution programmatically

前端 未结 1 1232
日久生厌
日久生厌 2020-12-10 09:24

I want to create a scheduler script to run the same spider multiple times in a sequence.

So far I got the following:

#!/usr/bin/python3
\"\"\"Schedul         


        
相关标签:
1条回答
  • 2020-12-10 10:12

    You're getting the ReactorNotRestartable error because the Reactor cannot be started multiple times in Twisted. Basically, each time process.start() is called, it will try to start the reactor. There's plenty of information around the web about this. Here's a simple solution:

    from twisted.internet import reactor
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.project import get_project_settings
    
    from my_project.spiders.deals import DealsSpider
    
    
    def crawl_job():
        """
        Job to start spiders.
        Return Deferred, which will execute after crawl has completed.
        """
        settings = get_project_settings()
        runner = CrawlerRunner(settings)
        return runner.crawl(DealsSpider)
    
    def schedule_next_crawl(null, sleep_time):
        """
        Schedule the next crawl
        """
        reactor.callLater(sleep_time, crawl)
    
    def crawl():
        """
        A "recursive" function that schedules a crawl 30 seconds after
        each successful crawl.
        """
        # crawl_job() returns a Deferred
        d = crawl_job()
        # call schedule_next_crawl(<scrapy response>, n) after crawl job is complete
        d.addCallback(schedule_next_crawl, 30)
        d.addErrback(catch_error)
    
    def catch_error(failure):
        print(failure.value)
    
    if __name__=="__main__":
        crawl()
        reactor.run()
    

    There are a few noticeable differences from your snippet. The reactor is directly called, substitute CrawlerProcess for CrawlerRunner, time.sleep has been removed so that the reactor doesn't block, the while loop has been replaced with a continuous call to the crawl function via callLater. It's short and should do what you want. If any parts confuse you, let me know and I'll elaborate.

    UPDATE - Crawl at a specific time

    import datetime as dt
    
    def schedule_next_crawl(null, hour, minute):
        tomorrow = (
            dt.datetime.now() + dt.timedelta(days=1)
            ).replace(hour=hour, minute=minute, second=0, microsecond=0)
        sleep_time = (tomorrow - dt.datetime.now()).total_seconds()
        reactor.callLater(sleep_time, crawl)
    
    def crawl():
        d = crawl_job()
        # crawl everyday at 1pm
        d.addCallback(schedule_next_crawl, hour=13, minute=30)
    
    0 讨论(0)
提交回复
热议问题