Calling scrapy from a python script not creating JSON output file

浪尽此生 提交于 2019-11-30 15:58:56

This code worked for me:

from scrapy import signals, log
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.http import Request
from multiprocessing.queues import Queue
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process
# import your spider here
def handleSpiderIdle(spider):
        reactor.stop()
mySettings = {'LOG_ENABLED': True, 'ITEM_PIPELINES': '<name of your project>.pipelines.scrapermar11Pipeline'} 

settings.overrides.update(mySettings)

crawlerProcess = CrawlerProcess(settings)
crawlerProcess.install()
crawlerProcess.configure()

spider = <nameofyourspider>(domain="") # create a spider ourselves
crawlerProcess.crawl(spider) # add it to spiders pool

dispatcher.connect(handleSpiderIdle, signals.spider_idle) # use this if you need to handle idle event (restart spider?)

log.start() # depends on LOG_ENABLED
print "Starting crawler."
crawlerProcess.start()
print "Crawler stopped."

A solution that worked for me was to ditch the run script and use of the internal API and use the command line & GNU Parallel to parallelize instead.

To run all known spiders, one per core:

scrapy list | parallel --line-buffer scrapy crawl

scrapy list lists all spiders one per line, allowed us to pipe them as arguments to append to a command (scrapy crawl) passed to GNU Parallel instead. --line-buffer means that output received back from the processes will be be printed to stdout mixed, but on a line-by-line basis rather than quater/half lines being garbled together (for other options look at --group and --ungroup).

NB: obviously this works best on machines that have multiple CPU cores as by default, GNU Parallel will run one job per core. Note that unlike many modern development machines, the cheap AWS EC2 & DigitalOcean tiers only have one virtual CPU core. Therefore if you wish to run jobs simultaneously on one core you will have to play with the --jobs argument to GNU Parellel. e.g to run 2 scrapy crawlers per core:

scrapy list | parallel --jobs 200% --line-buffer scrapy crawl
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!