Is there any method to using seperate scrapy pipeline for each spider?

问题

I wanna to fetch web pages under different domain, that means I have to use different spider under the command "scrapy crawl myspider". However, I have to use different pipeline logic to put the data into database since the content of web pages are different. But for every spider, they have to go through all of the pipelines which defined in settings.py. Is there have other elegant method to using seperate pipelines for each spider?

回答1:

ITEM_PIPELINES setting is defined globally for all spiders in the project during the engine start. It cannot be changed per spider on the fly.

Here are some options to consider:

Change the code of pipelines. Skip/continue processing items returned by spiders in the process_item method of your pipeline, e.g.:

def process_item(self, item, spider): 
    if spider.name not in ['spider1', 'spider2']: 
        return item  

    # process item

Change the way you start crawling. Do it from a script, based on spider name passed as a parameter, override your ITEM_PIPELINES setting before calling crawler.configure().

回答2:

A slightly better version of the above is as follows. It is better because this way allows you to selectively turn on pipelines for different spiders more easily than the coding of 'not in ['spider1','spider2']' in the pipeline above.

In your spider class, add:

#start_urls=...
pipelines = ['pipeline1', 'pipeline2'] #allows you to selectively turn on pipelines within spiders
#...

Then in each pipeline, you can use the getattr method as magic. Add:

class pipeline1():  
    def process_item(self, item, spider):
       if 'pipeline1' not in getattr(spider, 'pipelines'):
          return item
       #...keep going as normal

回答3:

A more robust solution; Can't remember where I found it but a scrapy dev proposed it somewhere.. Using this method allows you to have some pipeline run on all spiders by not using the wrapper. It also makes it so you don't have to duplicate the logic of checking whether or not to use the pipeline.

Wrapper:

def check_spider_pipeline(process_item_method):
    """
        This wrapper makes it so pipelines can be turned on and off at a spider level.
    """
    @functools.wraps(process_item_method)
    def wrapper(self, item, spider):
        if self.__class__ in spider.pipeline:
            return process_item_method(self, item, spider)
        else:
            return item

    return wrapper

Usage:

@check_spider_pipeline
def process_item(self, item, spider):
    ........
    ........
    return item

Spider usage:

pipeline = {some.pipeline, some.other.pipeline .....}

来源：https://stackoverflow.com/questions/17381237/is-there-any-method-to-using-seperate-scrapy-pipeline-for-each-spider

标签

python

web-scraping

scrapy

scrapy-spider