Is there any method to using seperate scrapy pipeline for each spider?

廉价感情. 提交于 2019-12-04 09:44:20

问题


I wanna to fetch web pages under different domain, that means I have to use different spider under the command "scrapy crawl myspider". However, I have to use different pipeline logic to put the data into database since the content of web pages are different. But for every spider, they have to go through all of the pipelines which defined in settings.py. Is there have other elegant method to using seperate pipelines for each spider?


回答1:


ITEM_PIPELINES setting is defined globally for all spiders in the project during the engine start. It cannot be changed per spider on the fly.

Here are some options to consider:

  • Change the code of pipelines. Skip/continue processing items returned by spiders in the process_item method of your pipeline, e.g.:

    def process_item(self, item, spider): 
        if spider.name not in ['spider1', 'spider2']: 
            return item  
    
        # process item
    
  • Change the way you start crawling. Do it from a script, based on spider name passed as a parameter, override your ITEM_PIPELINES setting before calling crawler.configure().

See also:

  • Scrapy. How to change spider settings after start crawling?
  • Can I use spider-specific settings?
  • Using one Scrapy spider for several websites
  • related answer

Hope that helps.




回答2:


A slightly better version of the above is as follows. It is better because this way allows you to selectively turn on pipelines for different spiders more easily than the coding of 'not in ['spider1','spider2']' in the pipeline above.

In your spider class, add:

#start_urls=...
pipelines = ['pipeline1', 'pipeline2'] #allows you to selectively turn on pipelines within spiders
#...

Then in each pipeline, you can use the getattr method as magic. Add:

class pipeline1():  
    def process_item(self, item, spider):
       if 'pipeline1' not in getattr(spider, 'pipelines'):
          return item
       #...keep going as normal  



回答3:


A more robust solution; Can't remember where I found it but a scrapy dev proposed it somewhere.. Using this method allows you to have some pipeline run on all spiders by not using the wrapper. It also makes it so you don't have to duplicate the logic of checking whether or not to use the pipeline.

Wrapper:

def check_spider_pipeline(process_item_method):
    """
        This wrapper makes it so pipelines can be turned on and off at a spider level.
    """
    @functools.wraps(process_item_method)
    def wrapper(self, item, spider):
        if self.__class__ in spider.pipeline:
            return process_item_method(self, item, spider)
        else:
            return item

    return wrapper

Usage:

@check_spider_pipeline
def process_item(self, item, spider):
    ........
    ........
    return item

Spider usage:

pipeline = {some.pipeline, some.other.pipeline .....}


来源:https://stackoverflow.com/questions/17381237/is-there-any-method-to-using-seperate-scrapy-pipeline-for-each-spider

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!