How many items has been scraped per start_url

六眼飞鱼酱① 提交于 2019-12-05 14:29:27
eLRuLL

challenge accepted!

there isn't something on scrapy that directly supports this, but you could separate it from your spider code with a Spider Middleware:

middlewares.py

from scrapy.http.request import Request

class StartRequestsCountMiddleware(object):

    start_urls = {}

    def process_start_requests(self, start_requests, spider):
        for i, request in enumerate(start_requests):
            self.start_urls[i] = request.url
            request.meta.update(start_request_index=i)
            yield request

    def process_spider_output(self, response, result, spider):
        for output in result:
            if isinstance(output, Request):
                output.meta.update(
                    start_request_index=response.meta['start_request_index'],
                )
            else:
                spider.crawler.stats.inc_value(
                    'start_requests/item_scraped_count/{}'.format(
                        self.start_urls[response.meta['start_request_index']],
                    ),
                )
            yield output

Remember to activate it on settings.py:

SPIDER_MIDDLEWARES = {
    ...
    'myproject.middlewares.StartRequestsCountMiddleware': 200,
}

Now you should be able to see something like this on your spider stats:

'start_requests/item_scraped_count/START_URL1': ITEMCOUNT1,
'start_requests/item_scraped_count/START_URL2': ITEMCOUNT2,
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!