Scrapy run multiple spiders from a script

雨燕双飞 提交于 2021-01-29 15:53:31

问题


Hey following question:

I'm having a script I want Scrapy spiders to start from. For that I used a solution from another stack overflow post to integrate the settings so I don't have to overwrite them manually. So until now I'm able to start two crawlers from outside the Scrapy project:

from scrapy_bots.update_Database.update_Database.spiders.m import M
from scrapy_bots.update_Database.update_Database.spiders.p import P
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import os

class Scraper:
    def __init__(self):
        settings_file_path = 'scrapy_bots.update_Database.update_Database.settings' # The path seen from root, ie. from main.py
        os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
        self.process = CrawlerProcess(get_project_settings())
        self.spider_m = M # The spider you want to crawl
        self.spider_p = P

    def run_spiders(self):
        self.process.crawl(self.spider_m)
        self.process.crawl(self.spider_p)
        self.process.start()  # the script will block here until the crawling is finished

Getting scrapy project settings when script is outside of root directory

That piece of code is stored in a file next to the Scrapy.cfg file. To run the spider from another directory I just had to change some settings in the settings file and then I'm able to run the spider from another directory by instancing an new object of that class so far so good: (for a better explanation look at the link above).

Now my problems:

1.I have to run the two spiders sequentially at the moment they run simultaneously (I already took a look at the scrapy docs but the code doesn't work for me because when I implemented it it said there is no function run in reactor.

reactor.run()
reactor.stop()

Thats not a surprise because reactor looks like this when I take look in it:

# Copyright (c) Twisted Matrix Laboratories.
# See LICENSE for details.

"""
The reactor is the Twisted event loop within Twisted, the loop which drives
applications using Twisted. The reactor provides APIs for networking,
threading, dispatching events, and more.

The default reactor depends on the platform and will be installed if this
module is imported without another reactor being explicitly installed
beforehand. Regardless of which reactor is installed, importing this module is
the correct way to get a reference to it.

New application code should prefer to pass and accept the reactor as a
parameter where it is needed, rather than relying on being able to import this
module to get a reference.  This simplifies unit testing and may make it easier
to one day support multiple reactors (as a performance enhancement), though
this is not currently possible.

@see: L{IReactorCore<twisted.internet.interfaces.IReactorCore>}
@see: L{IReactorTime<twisted.internet.interfaces.IReactorTime>}
@see: L{IReactorProcess<twisted.internet.interfaces.IReactorProcess>}
@see: L{IReactorTCP<twisted.internet.interfaces.IReactorTCP>}
@see: L{IReactorSSL<twisted.internet.interfaces.IReactorSSL>}
@see: L{IReactorUDP<twisted.internet.interfaces.IReactorUDP>}
@see: L{IReactorMulticast<twisted.internet.interfaces.IReactorMulticast>}
@see: L{IReactorUNIX<twisted.internet.interfaces.IReactorUNIX>}
@see: L{IReactorUNIXDatagram<twisted.internet.interfaces.IReactorUNIXDatagram>}
@see: L{IReactorFDSet<twisted.internet.interfaces.IReactorFDSet>}
@see: L{IReactorThreads<twisted.internet.interfaces.IReactorThreads>}
@see: L{IReactorPluggableResolver<twisted.internet.interfaces.IReactorPluggableResolver>}
"""

from __future__ import division, absolute_import

import sys
del sys.modules['twisted.internet.reactor']
from twisted.internet import default
default.install()

https://docs.scrapy.org/en/latest/topics/practices.html

  1. I want to run that function multiple times in my script at different points. I already read a lot of posts how to handle the Twisted Reactor can't get reastarted Error but nothing worked for me.

Scrapy - Reactor not Restartable

It would be great if somebody could help me with some detailed explanation because I'm getting nowhere with all the other posts.

来源:https://stackoverflow.com/questions/61444819/scrapy-run-multiple-spiders-from-a-script

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!