Getting scrapy project settings when script is outside of root directory

前端 未结 5 2165
栀梦
栀梦 2020-12-17 17:02

I have made a Scrapy spider that can be successfully run from a script located in the root directory of the project. As I need to run multiple spiders from different project

相关标签:
5条回答
  • 2020-12-17 17:31

    this could happen because you are no longer "inside" a scrapy project, so it doesn't know how to get the settings with get_project_settings().

    You can also specify the settings as a dictionary as the example here:

    http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script

    0 讨论(0)
  • 2020-12-17 17:32

    Thanks to some of the answers already provided here, I realised scrapy wasn't actually importing the settings.py file. This is how I fixed it.

    TLDR: Make sure you set the 'SCRAPY_SETTINGS_MODULE' variable to your actual settings.py file. I'm doing this in the __init__() func of Scraper.

    Consider a project with the following structure.

    my_project/
        main.py                 # Where we are running scrapy from
        scraper/
            run_scraper.py               #Call from main goes here
            scrapy.cfg                   # deploy configuration file
            scraper/                     # project's Python module, you'll import your code from here
                __init__.py
                items.py                 # project items definition file
                pipelines.py             # project pipelines file
                settings.py              # project settings file
                spiders/                 # a directory where you'll later put your spiders
                    __init__.py
                    quotes_spider.py     # Contains the QuotesSpider class
    

    Basically, the command scrapy startproject scraper was executed in the my_project folder, I've added a run_scraper.py file to the outer scraper folder, a main.py file to my root folder, and quotes_spider.py to the spiders folder.

    My main file:

    from scraper.run_scraper import Scraper
    scraper = Scraper()
    scraper.run_spiders()
    

    My run_scraper.py file:

    from scraper.scraper.spiders.quotes_spider import QuotesSpider
    from scrapy.crawler import CrawlerProcess
    from scrapy.utils.project import get_project_settings
    import os
    
    
    class Scraper:
        def __init__(self):
            settings_file_path = 'scraper.scraper.settings' # The path seen from root, ie. from main.py
            os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
            self.process = CrawlerProcess(get_project_settings())
            self.spider = QuotesSpider # The spider you want to crawl
    
        def run_spiders(self):
            self.process.crawl(self.spider)
            self.process.start()  # the script will block here until the crawling is finished
    

    Also, note that the settings might require a look-over, since the path needs to be according to the root folder (my_project, not scraper). So in my case:

    SPIDER_MODULES = ['scraper.scraper.spiders']
    NEWSPIDER_MODULE = 'scraper.scraper.spiders'
    

    And repeat for all the settings variables you have!

    0 讨论(0)
  • 2020-12-17 17:39

    It should work , can you share your scrapy log file

    Edit: your approach will not work because ...when you execute the script..it will look for your default settings in

    1. if you have set the environment variable ENVVAR
    2. if you have scrapy.cfg file in you present directory from where you are executing your script and if that file points to valid settings.py directory ,it will load those settings...
    3. else it will run with vanilla settings provided by scrapy ( your case)

    Solution 1 create a cfg file inside the directory (outside folder) and give it a path to the valid settings.py file

    Solution 2 make your parent directory package , so that absolute path will not be required and you can use relative path

    i.e python -m cron.project1

    Solution 3

    Also you can try something like

    Let it be where it is , inside the project directory..where it is working...

    Create a sh file...

    • Line 1: Cd to first projects location ( root directory)
    • Line 2 : Python script1.py
    • Line 3. Cd to second projects location
    • Line 4: python script2.py

    Now you can execute spiders via this sh file when requested by django

    0 讨论(0)
  • 2020-12-17 17:44

    I used the OS module for this problem. The python file you are running is in one directory and your scrapy project is in a different directory. You can not simply just import the python spider and run on this python script because the current directory you are working in does not have the settings.py file or the scrapy.cfg.

    import os

    To show the current directory you are working in use the following code:

    print(os.getcwd())

    From here you are going to want to change the current directory:

    os.chdir(\path\to\spider\folder)

    Lastly, tell os which command to execute.

    os.system('scrape_file.py')

    0 讨论(0)
  • 2020-12-17 17:53

    I have used this code to solve the problem:

    from scrapy.settings import Settings
    
    settings = Settings()
    
    settings_module_path = os.environ.get('SCRAPY_ENV', 'project.settings.dev')   
    settings.setmodule(settings_module_path, priority='project')
    
    print(settings.get('BASE_URL'))
    
    0 讨论(0)
提交回复
热议问题