问题
For now I have 2 spiders, what I would like to do is
- Spider
1goes tourl1and ifurl2appears, call spider2withurl2. Also saves the content ofurl1by using pipeline. - Spider
2goes tourl2and do something.
Due to the complexities of both spiders I would like to have them separated.
What I have tried using scrapy crawl:
def parse(self, response):
p = multiprocessing.Process(
target=self.testfunc())
p.join()
p.start()
def testfunc(self):
settings = get_project_settings()
crawler = CrawlerRunner(settings)
crawler.crawl(<spidername>, <arguments>)
It does load the settings but doesn't crawl:
2015-08-24 14:13:32 [scrapy] INFO: Enabled extensions: CloseSpider, LogStats, CoreStats, SpiderState
2015-08-24 14:13:32 [scrapy] INFO: Enabled downloader middlewares: DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, HttpAuthMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-24 14:13:32 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-24 14:13:32 [scrapy] INFO: Spider opened
2015-08-24 14:13:32 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
The documentations has a example about launching from script, but what I'm trying to do is launch another spider while using scrapy crawl command.
edit: Full code
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from multiprocessing import Process
import scrapy
import os
def info(title):
print(title)
print('module name:', __name__)
if hasattr(os, 'getppid'): # only available on Unix
print('parent process:', os.getppid())
print('process id:', os.getpid())
class TestSpider1(scrapy.Spider):
name = "test1"
start_urls = ['http://www.google.com']
def parse(self, response):
info('parse')
a = MyClass()
a.start_work()
class MyClass(object):
def start_work(self):
info('start_work')
p = Process(target=self.do_work)
p.start()
p.join()
def do_work(self):
info('do_work')
settings = get_project_settings()
runner = CrawlerRunner(settings)
runner.crawl(TestSpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
return
class TestSpider2(scrapy.Spider):
name = "test2"
start_urls = ['http://www.google.com']
def parse(self, response):
info('testspider2')
return
What I hope is like:
- scrapy crawl test1 (for example, when response.status_code is 200:)
- in test1, call
scrapy crawl test2
回答1:
I won't go in depth given since this question is really old but I'll go ahead drop this snippet from the official Scrappy docs.... You are very close! lol
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
https://doc.scrapy.org/en/latest/topics/practices.html
And then using callbacks you can pass items between your spiders do do w.e logic functions your talking about
回答2:
We should not run a spider from a spider. In my understanding, you want to run a spider when other spider finish, right ? If so, Let's use below source code:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from datascraper.spiders.file1_spd import Spider1ClassName
from datascraper.spiders.file2_spd import Spider2ClassName
from scrapy.utils.project import get_project_settings
@defer.inlineCallbacks
def crawl():
yield runner.crawl(Spider1ClassName)
yield runner.crawl(Spider2ClassName)
reactor.stop()
configure_logging()
config = get_project_settings()
runner = CrawlerRunner(settings=config)
crawl()
reactor.run() # the script will block here until the last crawl call is finished
You could refer here: https://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
来源:https://stackoverflow.com/questions/32176005/is-it-possible-to-run-another-spider-from-scrapy-spider