Getting scrapy project settings when script is outside of root directory

前端 未结 5 2171
栀梦
栀梦 2020-12-17 17:02

I have made a Scrapy spider that can be successfully run from a script located in the root directory of the project. As I need to run multiple spiders from different project

5条回答
  •  庸人自扰
    2020-12-17 17:32

    Thanks to some of the answers already provided here, I realised scrapy wasn't actually importing the settings.py file. This is how I fixed it.

    TLDR: Make sure you set the 'SCRAPY_SETTINGS_MODULE' variable to your actual settings.py file. I'm doing this in the __init__() func of Scraper.

    Consider a project with the following structure.

    my_project/
        main.py                 # Where we are running scrapy from
        scraper/
            run_scraper.py               #Call from main goes here
            scrapy.cfg                   # deploy configuration file
            scraper/                     # project's Python module, you'll import your code from here
                __init__.py
                items.py                 # project items definition file
                pipelines.py             # project pipelines file
                settings.py              # project settings file
                spiders/                 # a directory where you'll later put your spiders
                    __init__.py
                    quotes_spider.py     # Contains the QuotesSpider class
    

    Basically, the command scrapy startproject scraper was executed in the my_project folder, I've added a run_scraper.py file to the outer scraper folder, a main.py file to my root folder, and quotes_spider.py to the spiders folder.

    My main file:

    from scraper.run_scraper import Scraper
    scraper = Scraper()
    scraper.run_spiders()
    

    My run_scraper.py file:

    from scraper.scraper.spiders.quotes_spider import QuotesSpider
    from scrapy.crawler import CrawlerProcess
    from scrapy.utils.project import get_project_settings
    import os
    
    
    class Scraper:
        def __init__(self):
            settings_file_path = 'scraper.scraper.settings' # The path seen from root, ie. from main.py
            os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
            self.process = CrawlerProcess(get_project_settings())
            self.spider = QuotesSpider # The spider you want to crawl
    
        def run_spiders(self):
            self.process.crawl(self.spider)
            self.process.start()  # the script will block here until the crawling is finished
    

    Also, note that the settings might require a look-over, since the path needs to be according to the root folder (my_project, not scraper). So in my case:

    SPIDER_MODULES = ['scraper.scraper.spiders']
    NEWSPIDER_MODULE = 'scraper.scraper.spiders'
    

    And repeat for all the settings variables you have!

提交回复
热议问题