Get Scrapy crawler output/results in script file function

问题

I am using a script file to run a spider within scrapy project and spider is logging the crawler output/results. But i want to use spider output/results in that script file in some function .I did not want to save output/results in any file or DB. Here is Script code get from https://doc.scrapy.org/en/latest/topics/practices.html#run-from-script

from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner(get_project_settings())


d = runner.crawl('my_spider')
d.addBoth(lambda _: reactor.stop())
reactor.run()

def spider_output(output):
#     do something to that output

How can i get spider output in 'spider_output' method. It is possible to get output/results.

回答1:

Here is the solution that get all output/results in a list

from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

from scrapy.signalmanager import dispatcher


def spider_results():
    results = []

    def crawler_results(signal, sender, item, response, spider):
        results.append(item)

    dispatcher.connect(crawler_results, signal=signals.item_passed)

    process = CrawlerProcess(get_project_settings())
    process.crawl(MySpider)
    process.start()  # the script will block here until the crawling is finished
    return results


if __name__ == '__main__':
    print(spider_results())

回答2:

AFAIK there is no way to do this, since crawl():

Returns a deferred that is fired when the crawling is finished.

And the crawler doesn't store results anywhere other than outputting them to logger.

However returning ouput would conflict with the whole asynchronious nature and structure of scrapy, so saving to file then reading it is a prefered approach here.
You can simply devise pipeline that saves your items to file and simply read the file in your spider_output. You will receive your results since reactor.run() is blocking your script untill the output file is complete anyways.

回答3:

My advice is to use the Python subprocess module to run spider from the script rather than using the method provided in the scrapy docs to run spider from python script. The reason for that is that with the subprocess module, you can capture the output/logs and even statements that you print from inside the spider.

In Python 3, execute the spider with the run method. Ex.

import subprocess
process = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
if process.returncode == 0:
    result = process.stdout.decode('utf-8')
else:
    # code to check error using 'process.stderr'

Setting the stdout/stderr to subprocess.PIPE will allow capture of output so it's very important to set this flag. Here command should be a sequence or a string (It it's a string, then call the run method with 1 more param: shell=True). For example:

command = ['scrapy', 'crawl', 'website', '-a', 'customArg=blahblah']
# or
command = 'scrapy crawl website -a customArg=blahblah' # with shell=True
#or
import shlex
command = shlex.split('scrapy crawl website -a customArg=blahblah') # without shell=True

Also, process.stdout will contain the output from the script but it will be of type bytes. You need to convert it to str using decode('utf-8')

来源：https://stackoverflow.com/questions/40237952/get-scrapy-crawler-output-results-in-script-file-function

标签

python

scrapy

web-crawler

twisted

scrapy-spider