How to schedule Scrapy crawl execution programmatically

前端未结

关注

 1  1232

I want to create a scheduler script to run the same spider multiple times in a sequence.

So far I got the following:

#!/usr/bin/python3
\"\"\"Schedul


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  说谎        
                
              
                            
                2020-12-10 10:12
              
            
            
                                                                       
You're getting the ReactorNotRestartable error because the Reactor cannot be started multiple times in Twisted. Basically, each time process.start() is called, it will try to start the reactor. There's plenty of information around the web about this. Here's a simple solution:

from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

from my_project.spiders.deals import DealsSpider


def crawl_job():
    """
    Job to start spiders.
    Return Deferred, which will execute after crawl has completed.
    """
    settings = get_project_settings()
    runner = CrawlerRunner(settings)
    return runner.crawl(DealsSpider)

def schedule_next_crawl(null, sleep_time):
    """
    Schedule the next crawl
    """
    reactor.callLater(sleep_time, crawl)

def crawl():
    """
    A "recursive" function that schedules a crawl 30 seconds after
    each successful crawl.
    """
    # crawl_job() returns a Deferred
    d = crawl_job()
    # call schedule_next_crawl(<scrapy response>, n) after crawl job is complete
    d.addCallback(schedule_next_crawl, 30)
    d.addErrback(catch_error)

def catch_error(failure):
    print(failure.value)

if __name__=="__main__":
    crawl()
    reactor.run()


There are a few noticeable differences from your snippet. The reactor is directly called, substitute CrawlerProcess for CrawlerRunner, time.sleep has been removed so that the reactor doesn't block, the while loop has been replaced with a continuous call to the crawl function via callLater. It's short and should do what you want. If any parts confuse you, let me know and I'll elaborate.

UPDATE - Crawl at a specific time

import datetime as dt

def schedule_next_crawl(null, hour, minute):
    tomorrow = (
        dt.datetime.now() + dt.timedelta(days=1)
        ).replace(hour=hour, minute=minute, second=0, microsecond=0)
    sleep_time = (tomorrow - dt.datetime.now()).total_seconds()
    reactor.callLater(sleep_time, crawl)

def crawl():
    d = crawl_job()
    # crawl everyday at 1pm
    d.addCallback(schedule_next_crawl, hour=13, minute=30)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复