Scrapy - offsite request to be processed based on a regex

我只是一个虾纸丫 提交于 2019-12-25 08:16:22

问题


I have to crawl 5-6 domains. I wanted to write a the crawler as such that the offsite requests if contains a some substrings example set as [ aaa,bbb,ccc] if the offsite url contain a substring from the above set then it should be processed and not filter out. Should i write a custom middleware or can i just use regular expression in the allowed domains.


回答1:


Offsite middleware already uses regex by default, however it's no exposed. It compiles the domains you provide into regex, but the domains are escaped so providing regex code in allowed_domains would not work.

What you can do though is extend that middleware and override get_host_regex() method to implement your own offsite policy.

The original code in scrapy.spidermiddlewares.offsite.OffsiteMiddleware:

def get_host_regex(self, spider):
    """Override this method to implement a different offsite policy"""
    allowed_domains = getattr(spider, 'allowed_domains', None)
    if not allowed_domains:
        return re.compile('') # allow all by default
    regex = r'^(.*\.)?(%s)$' % '|'.join(re.escape(d) for d in allowed_domains if d is not None)
    return re.compile(regex)

You can just override to return your own regex:

# middlewares.py    
class MyOffsiteMiddleware(OffsiteMiddleware):
    def get_host_regex(self, spider):
        allowed_regex = getattr(spider, 'allowed_regex', '') 
        return re.compile(allowed_regex)

# spiders/myspider.py 
class MySpider(scrapy.Spider):
    allowed_regex = '.+?\.com'

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.MyOffsiteMiddleware': 666,
}


来源:https://stackoverflow.com/questions/39093211/scrapy-offsite-request-to-be-processed-based-on-a-regex

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!