Running Multiple Scrapy Spiders (the easy way) Python

前端 未结 3 908
清歌不尽
清歌不尽 2020-12-28 09:11

Scrapy is pretty cool, however I found the documentation to very bare bones, and some simple questions were tough to answer. After putting together various techniques from v

相关标签:
3条回答
  • 2020-12-28 09:52

    yes there is an excellent companion to scrapy called scrapyd that's doing exactly what you are looking for, among many other goodies, you can also launch spiders through it, like this:

    $ curl http://localhost:6800/schedule.json -d project=myproject -d spider=spider2
    {"status": "ok", "jobid": "26d1b1a6d6f111e0be5c001e648c57f8"}
    

    you can add your custom parameters as well using -d param=123

    btw, spiders are being scheduled and not launched cause scrapyd manage a queue with (configurable) max number of running spiders in parallel

    0 讨论(0)
  • 2020-12-28 09:55

    Your method makes it procedural which makes it slow, against Scrapy's main principal. To make it asynchronous as always, you can try using CrawlerProcess

    from scrapy.utils.project import get_project_settings
    from scrapy.crawler import CrawlerProcess
    
    from myproject.spiders import spider1, spider2
    
    1Spider = spider1.1Spider()
    2Spider = spider2.2Spider()
    process = CrawlerProcess(get_project_settings())
    process.crawl(1Spider)
    process.crawl(2Spider)
    process.start()
    

    If you want to see the full log of the crawl, set LOG_FILE in your settings.py.

    LOG_FILE = "logs/mylog.log"
    
    0 讨论(0)
  • 2020-12-28 10:01

    Here it is the easy way. you need to save this code at the same directory with scrapy.cfg (My scrapy version is 1.3.3) :

    from scrapy.utils.project import get_project_settings
    from scrapy.crawler import CrawlerProcess
    
    setting = get_project_settings()
    process = CrawlerProcess(setting)
    
    for spider_name in process.spiders.list():
        print ("Running spider %s" % (spider_name))
        process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy
    
    process.start()
    

    and run it. thats it!

    0 讨论(0)
提交回复
热议问题