There as several similar questions that I have already read on Stack Overflow. Unfortunately, I lost links of all of them, because my browsing history got deleted unexpected
Scrapy is created with Twisted, and this framework already has its way of running multiple processes. There is nice question about this here. In your approach you are actually trying to marry two incompatible and competing libraries (Scrapy/Twisted + multiprocessing). This is probably not best idea, you can run into lots of problems with that.
If you would like to run Scrapy spiders in multiple processes it will much easier to just use Twisted. You could just read Twisted docs for spawnProcess and other calls and try to those tools for your goal. For example here's quick and dirty implementation that runs two spiders in two processes
from twisted.internet import defer, protocol, reactor
import os
class SpiderRunnerProtocol(protocol.ProcessProtocol):
def __init__(self, d, inputt=None):
self.deferred = d
self.inputt = inputt
self.output = ""
self.err = ""
def connectionMade(self):
if self.inputt:
self.transport.write(self.inputt)
self.transport.closeStdin()
def outReceived(self, data):
self.output += data
def processEnded(self, reason):
print(reason.value)
print(self.err)
self.deferred.callback(self.output)
def errReceived(self, data):
self.err += data
def run_spider(cmd, *args, **kwargs):
d = defer.Deferred()
pipe = SpiderRunnerProtocol(d)
args = [cmd] + list(args)
env = os.environ.copy()
x = reactor.spawnProcess(pipe, cmd, args, env=env)
print(x.pid)
print(x)
return d
def print_out(result):
print(result)
d = run_spider("scrapy", "crawl", "reddit")
d = run_spider("scrapy", "crawl", "dmoz")
d.addCallback(print_out)
d.addCallback(lambda _: reactor.stop())
reactor.run()
There's a nice blog post explaining usage of Twisted subprocesses here