Running Scrapy multiple times in the same process

大兔子大兔子 提交于 2021-01-28 07:04:24

问题


I have a list of URLs. I want to crawl each of these. Please note

  • adding this array as start_urls is not the behavior I'm looking for. I would like this to run one by one in separate crawl sessions.
  • I want to run Scrapy multiple times in the same process
  • I want to run Scrapy as a script, as covered in Common Practices, and not from the CLI.

The following code is a full, broken, copy-pastable example. It basically tries to loop through a list of URLs and start the crawler on each of them. This is based on the Common Practices documentation.

from urllib.parse import urlparse
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.spiders import CrawlSpider


class MySpider(CrawlSpider):
    name = 'my-spider'

    def __init__(self, start_url,  *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.start_urls = [start_url]
        self.allowed_domains = [urlparse(start_url).netloc]


urls = [
    'http://testphp.vulnweb.com/',
    'http://testasp.vulnweb.com/'
]

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()

for url in urls:
    runner.crawl(MySpider, url)
    reactor.run()

The problem with the above is that it hangs after the first URL; the second URL is never crawled and nothing happens after this:

2018-08-13 20:28:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://testphp.vulnweb.com/> (referer: None)
[...]
2018-08-13 20:28:44 [scrapy.core.engine] INFO: Spider closed (finished)

回答1:


The reactor.run() will block your loop forever from the start. The only way around this is to play by the twisted rules. One way to do so is by replacing your loop with a twisted specific asynchronous loop like so:

# from twisted.internet.defer import inlineCallbacks
...

@inlineCallbacks
def loop_urls(urls):
    for url in urls:
        yield runner.crawl(MySpider, url)
    reactor.stop()

loop_urls(urls)
reactor.run()

and this magic roughly translates to:

def loop_urls(urls):
    url, *rest = urls
    dfd = runner.crawl(MySpider, url)
    # crawl() returns a deferred to which a callback (or errback) can be attached
    dfd.addCallback(lambda _: loop_urls(rest) if rest else reactor.stop())

loop_urls(urls)
reactor.run()

which you could use also but it's far from pretty.



来源:https://stackoverflow.com/questions/51829409/running-scrapy-multiple-times-in-the-same-process

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!