How to set different scrapy-settings for different spiders?

随声附和 提交于 2019-12-18 12:53:17

问题


I want to enable some http-proxy for some spiders, and disable them for other spiders.

Can I do something like this?

# settings.py
proxy_spiders = ['a1' , b2']

if spider in proxy_spider: #how to get spider name ???
    HTTP_PROXY = 'http://127.0.0.1:8123'
    DOWNLOADER_MIDDLEWARES = {
         'myproject.middlewares.RandomUserAgentMiddleware': 400,
         'myproject.middlewares.ProxyMiddleware': 410,
         'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
    }
else:
    DOWNLOADER_MIDDLEWARES = {
         'myproject.middlewares.RandomUserAgentMiddleware': 400,
         'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
    }

If the code above doesn't work, is there any other suggestion?


回答1:


You can define your own proxy middleware, something straightforward like this:

from scrapy.contrib.downloadermiddleware import HttpProxyMiddleware

class ConditionalProxyMiddleware(HttpProxyMiddleware):
    def process_request(self, request, spider):
        if getattr(spider, 'use_proxy', None):
            return super(ConditionalProxyMiddleware, self).process_request(request, spider)

Then define the attribute use_proxy = True in the spiders that you want to have the proxy enabled. Don't forget to disable the default proxy middleware and enable your modified one.




回答2:


a bit late, but since release 1.0.0 there is a new feature in scrapy where you can override settings per spider like this:

class MySpider(scrapy.Spider):
    name = "my_spider"
    custom_settings = {"HTTP_PROXY":'http://127.0.0.1:8123',
                       "DOWNLOADER_MIDDLEWARES": {'myproject.middlewares.RandomUserAgentMiddleware': 400,
                                                  'myproject.middlewares.ProxyMiddleware': 410,
                                                  'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None}}




class MySpider2(scrapy.Spider):
        name = "my_spider2"
        custom_settings = {"DOWNLOADER_MIDDLEWARES": {'myproject.middlewares.RandomUserAgentMiddleware': 400,
                                                      'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None}}



回答3:


There is a new and easier way to do this.

class MySpider(scrapy.Spider):
    name = 'myspider'

    custom_settings = {
        'SOME_SETTING': 'some value',
    }

I use Scrapy 1.3.1




回答4:


You can add setting.overrides within the spider.py file Example that works:

from scrapy.conf import settings

settings.overrides['DOWNLOAD_TIMEOUT'] = 300 

For you, something like this should also work

from scrapy.conf import settings

settings.overrides['DOWNLOADER_MIDDLEWARES'] = {
     'myproject.middlewares.RandomUserAgentMiddleware': 400,
     'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}



回答5:


Why not use two projects rather than only one?

Let's name these two projects with proj1 and proj2. In proj1's settings.py, put these settings:

HTTP_PROXY = 'http://127.0.0.1:8123'
DOWNLOADER_MIDDLEWARES = {
     'myproject.middlewares.RandomUserAgentMiddleware': 400,
     'myproject.middlewares.ProxyMiddleware': 410,
     'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}

In proj2's settings.py, put these settings:

DOWNLOADER_MIDDLEWARES = {
     'myproject.middlewares.RandomUserAgentMiddleware': 400,
     'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}


来源:https://stackoverflow.com/questions/19327406/how-to-set-different-scrapy-settings-for-different-spiders

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!