Multiprocessing of Scrapy Spiders in Parallel Processes

拥有回忆 提交于 2019-11-28 08:44:40
Pawel Miech

Scrapy is created with Twisted, and this framework already has its way of running multiple processes. There is nice question about this here. In your approach you are actually trying to marry two incompatible and competing libraries (Scrapy/Twisted + multiprocessing). This is probably not best idea, you can run into lots of problems with that.

If you would like to run Scrapy spiders in multiple processes it will much easier to just use Twisted. You could just read Twisted docs for spawnProcess and other calls and try to those tools for your goal. For example here's quick and dirty implementation that runs two spiders in two processes

from twisted.internet import defer, protocol, reactor
import os


class SpiderRunnerProtocol(protocol.ProcessProtocol):
    def __init__(self, d, inputt=None):
        self.deferred = d
        self.inputt = inputt
        self.output = ""
        self.err = ""

    def connectionMade(self):
        if self.inputt:
            self.transport.write(self.inputt)
        self.transport.closeStdin()

    def outReceived(self, data):
        self.output += data

    def processEnded(self, reason):
        print(reason.value)
        print(self.err)
        self.deferred.callback(self.output)

    def errReceived(self, data):
        self.err += data


def run_spider(cmd, *args, **kwargs):
    d = defer.Deferred()
    pipe = SpiderRunnerProtocol(d)
    args = [cmd] + list(args)
    env = os.environ.copy()
    x = reactor.spawnProcess(pipe, cmd, args, env=env)
    print(x.pid)
    print(x)
    return d


def print_out(result):
    print(result)

d = run_spider("scrapy", "crawl", "reddit")
d = run_spider("scrapy", "crawl", "dmoz")
d.addCallback(print_out)
d.addCallback(lambda _: reactor.stop())
reactor.run()

There's a nice blog post explaining usage of Twisted subprocesses here

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!