Multiprocessing of Scrapy Spiders in Parallel Processes

前端 未结 1 1325
甜味超标
甜味超标 2020-12-09 13:01

There as several similar questions that I have already read on Stack Overflow. Unfortunately, I lost links of all of them, because my browsing history got deleted unexpected

相关标签:
1条回答
  • 2020-12-09 13:46

    Scrapy is created with Twisted, and this framework already has its way of running multiple processes. There is nice question about this here. In your approach you are actually trying to marry two incompatible and competing libraries (Scrapy/Twisted + multiprocessing). This is probably not best idea, you can run into lots of problems with that.

    If you would like to run Scrapy spiders in multiple processes it will much easier to just use Twisted. You could just read Twisted docs for spawnProcess and other calls and try to those tools for your goal. For example here's quick and dirty implementation that runs two spiders in two processes

    from twisted.internet import defer, protocol, reactor
    import os
    
    
    class SpiderRunnerProtocol(protocol.ProcessProtocol):
        def __init__(self, d, inputt=None):
            self.deferred = d
            self.inputt = inputt
            self.output = ""
            self.err = ""
    
        def connectionMade(self):
            if self.inputt:
                self.transport.write(self.inputt)
            self.transport.closeStdin()
    
        def outReceived(self, data):
            self.output += data
    
        def processEnded(self, reason):
            print(reason.value)
            print(self.err)
            self.deferred.callback(self.output)
    
        def errReceived(self, data):
            self.err += data
    
    
    def run_spider(cmd, *args, **kwargs):
        d = defer.Deferred()
        pipe = SpiderRunnerProtocol(d)
        args = [cmd] + list(args)
        env = os.environ.copy()
        x = reactor.spawnProcess(pipe, cmd, args, env=env)
        print(x.pid)
        print(x)
        return d
    
    
    def print_out(result):
        print(result)
    
    d = run_spider("scrapy", "crawl", "reddit")
    d = run_spider("scrapy", "crawl", "dmoz")
    d.addCallback(print_out)
    d.addCallback(lambda _: reactor.stop())
    reactor.run()
    

    There's a nice blog post explaining usage of Twisted subprocesses here

    0 讨论(0)
提交回复
热议问题