Running Multiple spiders in scrapy for 1 website in parallel?

前端 未结 3 1585
傲寒
傲寒 2020-12-08 12:36

I want to crawl a website with 2 parts and my script is not as fast as I need.

Is it possible to launch 2 spiders, one for scraping the first part and the second one

相关标签:
3条回答
  • 2020-12-08 13:05

    I think what you are looking for is something like this:

    import scrapy
    from scrapy.crawler import CrawlerProcess
    
    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...
    
    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...
    
    process = CrawlerProcess()
    process.crawl(MySpider1)
    process.crawl(MySpider2)
    process.start() # the script will block here until all crawling jobs are finished
    

    You can read more at: running-multiple-spiders-in-the-same-process.

    0 讨论(0)
  • 2020-12-08 13:08

    Better solution is (if you have multiple spiders) it dynamically get spiders and run them.

    from scrapy import spiderloader
    from scrapy.utils import project
    from twisted.internet.defer import inlineCallbacks
    
    
    @inlineCallbacks
    def crawl():
        settings = project.get_project_settings()
        spider_loader = spiderloader.SpiderLoader.from_settings(settings)
        spiders = spider_loader.list()
        classes = [spider_loader.load(name) for name in spiders]
        for my_spider in classes:
            yield runner.crawl(my_spider)
        reactor.stop()
    
    crawl()
    reactor.run()
    

    (Second Solution): Because spiders.list() is deprecated in Scrapy 1.4 Yuda solution should be converted to something like

    from scrapy.utils.project import get_project_settings
    from scrapy.crawler import CrawlerProcess
    
    setting = get_project_settings()
    spider_loader = spiderloader.SpiderLoader.from_settings(settings)
    
    for spider_name in spider_loader.list():
        print ("Running spider %s" % (spider_name))
        process.crawl(spider_name) 
    process.start()
    
    0 讨论(0)
  • 2020-12-08 13:18

    Or you can run with like this, you need to save this code at the same directory with scrapy.cfg (My scrapy version is 1.3.3) :

    from scrapy.utils.project import get_project_settings
    from scrapy.crawler import CrawlerProcess
    
    setting = get_project_settings()
    process = CrawlerProcess(setting)
    
    for spider_name in process.spiders.list():
        print ("Running spider %s" % (spider_name))
        process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy
    
    process.start()
    
    0 讨论(0)
提交回复
热议问题