I want to create a scheduler script to run the same spider multiple times in a sequence.
So far I got the following:
#!/usr/bin/python3
\"\"\"Schedul
You're getting the ReactorNotRestartable
error because the Reactor
cannot be started multiple times in Twisted. Basically, each time process.start()
is called, it will try to start the reactor. There's plenty of information around the web about this. Here's a simple solution:
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from my_project.spiders.deals import DealsSpider
def crawl_job():
"""
Job to start spiders.
Return Deferred, which will execute after crawl has completed.
"""
settings = get_project_settings()
runner = CrawlerRunner(settings)
return runner.crawl(DealsSpider)
def schedule_next_crawl(null, sleep_time):
"""
Schedule the next crawl
"""
reactor.callLater(sleep_time, crawl)
def crawl():
"""
A "recursive" function that schedules a crawl 30 seconds after
each successful crawl.
"""
# crawl_job() returns a Deferred
d = crawl_job()
# call schedule_next_crawl(<scrapy response>, n) after crawl job is complete
d.addCallback(schedule_next_crawl, 30)
d.addErrback(catch_error)
def catch_error(failure):
print(failure.value)
if __name__=="__main__":
crawl()
reactor.run()
There are a few noticeable differences from your snippet. The reactor
is directly called, substitute CrawlerProcess
for CrawlerRunner
, time.sleep
has been removed so that the reactor doesn't block, the while
loop has been replaced with a continuous call to the crawl
function via callLater
. It's short and should do what you want. If any parts confuse you, let me know and I'll elaborate.
import datetime as dt
def schedule_next_crawl(null, hour, minute):
tomorrow = (
dt.datetime.now() + dt.timedelta(days=1)
).replace(hour=hour, minute=minute, second=0, microsecond=0)
sleep_time = (tomorrow - dt.datetime.now()).total_seconds()
reactor.callLater(sleep_time, crawl)
def crawl():
d = crawl_job()
# crawl everyday at 1pm
d.addCallback(schedule_next_crawl, hour=13, minute=30)