Separate output file for every url given in start_urls list of spider in scrapy

问题

I want to create separate output file for every url I have set in start_urls of spider or somehow want to split ouput files start url wise.

Following is the start_urls of my spider

start_urls = ['http://www.dmoz.org/Arts/', 'http://www.dmoz.org/Business/', 'http://www.dmoz.org/Computers/']

I want to create separate output file like

Arts.xml
Business.xml
Computers.xml

I don't know exactly how to do this. I am thinking to achieve this by implementing some thing like following in spider_opened method of item pipeline class,

import re
from scrapy import signals
from scrapy.contrib.exporter import XmlItemExporter

class CleanDataPipeline(object):
    def __init__(self):
        self.cnt = 0
        self.filename = ''

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):
        referer_url = response.request.headers.get('referer', None)
        if referer_url in spider.start_urls:
            catname = re.search(r'/(.*)$', referer_url, re.I)
            self.filename = catname.group(1)

        file = open('output/' + str(self.cnt) + '_' + self.filename + '.xml', 'w+b')
        self.exporter = XmlItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        #file.close()

    def process_item(self, item, spider):
        self.cnt = self.cnt + 1
        self.spider_closed(spider)
        self.spider_opened(spider)
        self.exporter.export_item(item)
        return item

Where I am trying to find the referer url of every scraped item within the start_urls list. If referer url is found in start_urls then file name will be created using that referer url. But problem is how to access response object inside spider_opened() method. If I can access it there, I can create file based on that.

Any help to find a way to perform this? Thanks in advance!

[EDIT]

Solved my problem by changing my pipelines code as followed.

import re
from scrapy import signals
from scrapy.contrib.exporter import XmlItemExporter

class CleanDataPipeline(object):
    def __init__(self):
        self.filename = ''
        self.exporters = {}

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider, fileName = 'default.xml'):
        self.filename = fileName
        file = open('output/' + self.filename, 'w+b')
        exporter = XmlItemExporter(file)
        exporter.start_exporting()
        self.exporters[fileName] = exporter

    def spider_closed(self, spider):
        for exporter in self.exporters.itervalues(): 
            exporter.finish_exporting()

    def process_item(self, item, spider):
        fname = 'default'
        catname = re.search(r'http://www.dmoz.org/(.*?)/', str(item['start_url']), re.I)
        if catname:
            fname = catname.group(1)
        self.curFileName = fname + '.xml'

        if self.filename == 'default.xml':
            if os.path.isfile('output/' + self.filename):
                os.rename('output/' + self.filename, 'output/' + self.curFileName)
            exporter = self.exporters['default.xml']
            del self.exporters['default.xml']
            self.exporters[self.curFileName] = exporter
            self.filename = self.curFileName

        if self.filename != self.curFileName and not self.exporters.get(self.curFileName):
            self.spider_opened(spider, self.curFileName)

        self.exporters[self.curFileName].export_item(item)
        return item

Also Implemented make_requests_from_url in spider to set start_url for every item.

def make_requests_from_url(self, url):
    request = Request(url, dont_filter=True)
    request.meta['start_url'] = url
    return request

回答1:

I'd implement a more explicit approach (not tested):

configure list of possible categories in settings.py:
```
CATEGORIES = ['Arts', 'Business', 'Computers']
```

define your start_urls based on the setting

start_urls = ['http://www.dmoz.org/%s' % category for category in settings.CATEGORIES]

add category Field to the Item class

in the spider's parse method set the category field according to the current response.url, e.g.:

def parse(self, response):
     ...
     item['category'] = next(category for category in settings.CATEGORIES if category in response.url)
     ...

in the pipeline open up exporters for all categories and choose which exporter to use based on the item['category']:

def spider_opened(self, spider):
    ...
    self.exporters = {}
    for category in settings.CATEGORIES:
        file = open('output/%s.xml' % category, 'w+b')
        exporter = XmlItemExporter(file)
        exporter.start_exporting()
        self.exporters[category] = exporter

def spider_closed(self, spider):
    for exporter in self.exporters.itervalues(): 
        exporter.finish_exporting()

def process_item(self, item, spider):
    self.exporters[item['category']].export_item(item)
    return item

You would probably need to tweak it a bit to make it work but I hope you got the idea - store the category inside the item being processed. Choose a file to export to based on the item category value.

Hope that helps.

回答2:

As long as you don't store it in the item itself, you can't really know the staring url. The following solution should work for you:

redefine the make_request_from_url to send the starting url with each Request you make. You can store it in meta attribute of your Request. Bypass this starting url with each following Request.
as soon as you decide to pass the element to pipeline, fill in the starting url for the item from response.meta['start_url']

Hope it helps. Following links may be helpful:

http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spider.Spider.make_requests_from_url

http://doc.scrapy.org/en/latest/topics/request-response.html?highlight=meta#passing-additional-data-to-callback-functions

来源：https://stackoverflow.com/questions/23868784/separate-output-file-for-every-url-given-in-start-urls-list-of-spider-in-scrapy

标签

python

web-scraping

scrapy

scrapy-spider