Dynamic rules based on start_urls for Scrapy CrawlSpider?

梦想与她 提交于 2019-12-19 05:09:04

问题


I'm writing a Scrapy scraper that uses CrawlSpider to crawl sites, go over their internal links, and scrape the contents of any external links (links with a domain different from the original domain).

I managed to do that with 2 rules but they are based on the domain of the site being crawled. If I want to run this on multiple websites I run into a problem because I don't know which "start_url" I'm currently on so I can't change the rule appropriately.

Here's what I came up with so far, it works for one website and I'm not sure how to apply it to a list of websites:

class HomepagesSpider(CrawlSpider):
    name = 'homepages'

    homepage = 'http://www.somesite.com'

    start_urls = [homepage]

    # strip http and www
    domain = homepage.replace('http://', '').replace('https://', '').replace('www.', '')
    domain = domain[:-1] if domain[-1] == '/' else domain

    rules = (
        Rule(LinkExtractor(allow_domains=(domain), deny_domains=()), callback='parse_internal', follow=True),
        Rule(LinkExtractor(allow_domains=(), deny_domains=(domain)), callback='parse_external', follow=False),
    )

    def parse_internal(self, response):

        # log internal page...

    def parse_external(self, response):

        # parse external page...

This can probably be done by just passing the start_url as an argument when calling the scraper, but I'm looking for a way to do that programmatically within the scraper itself.

Any ideas? Thanks!

Simon.


回答1:


I've found a very similar question and used the second option presented in the accepted answer to develop a workaround for this problem, since it's not supported out-of-the-box in scrapy.

I've created a function that gets a url as an input and creates rules for it:

def rules_for_url(self, url):

    domain = Tools.get_domain(url)

    rules = (
        Rule(LinkExtractor(allow_domains=(domain), deny_domains=()), callback='parse_internal', follow=True),
        Rule(LinkExtractor(allow_domains=(), deny_domains=(domain)), callback='parse_external', follow=False),
    )

    return rules

I then override some of CrawlSpider's functions.

  1. I changed _rules into a dictionary where the keys are the different website domains and the values are the rules for that domain (using rules_for_url). The population of _rules is done in _compile_rules

  2. I then make the appropriate changes in _requests_to_follow and _response_downloaded to support the new way of using _rules.

_rules = {}

def _requests_to_follow(self, response):
    if not isinstance(response, HtmlResponse):
        return
    seen = set()

    domain = Tools.get_domain(response.url)
    for n, rule in enumerate(self._rules[domain]):
        links = [lnk for lnk in rule.link_extractor.extract_links(response) 
                 if lnk not in seen]
        if links and rule.process_links:
            links = rule.process_links(links)
        for link in links:
            seen.add(link)
            r = self._build_request(domain + ';' + str(n), link)
            yield rule.process_request(r)

def _response_downloaded(self, response):

    meta_rule = response.meta['rule'].split(';')
    domain = meta_rule[0]
    rule_n = int(meta_rule[1])

    rule = self._rules[domain][rule_n]
    return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)

def _compile_rules(self):
    def get_method(method):
        if callable(method):
            return method
        elif isinstance(method, six.string_types):
            return getattr(self, method, None)

    for url in self.start_urls:
        url_rules = self.rules_for_url(url)
        domain = Tools.get_domain(url)
        self._rules[domain] = [copy.copy(r) for r in url_rules]
        for rule in self._rules[domain]:
            rule.callback = get_method(rule.callback)
            rule.process_links = get_method(rule.process_links)
            rule.process_request = get_method(rule.process_request)

See the original functions here.

Now the spider will simply go over each url in start_urls and create a set of rules specific for that url. Then use the appropriate rules for each website being crawled.

Hope this helps anyone who stumbles upon this problem in the future.

Simon.




回答2:


Iterate over all website links in start_urls and populate allowed_domains and deny_domains arrays. And then define Rules.

start_urls = ["www.website1.com", "www.website2.com", "www.website3.com", "www.website4.com"]

allow_domains = []
deny_domains = []

for link in start_urls

    # strip http and www
    domain = link.replace('http://', '').replace('https://', '').replace('www.', '')
    domain = domain[:-1] if domain[-1] == '/' else domain

    allow_domains.extend([domain])
    deny_domains.extend([domain])


rules = (
    Rule(LinkExtractor(allow_domains=allow_domains, deny_domains=()), callback='parse_internal', follow=True),
    Rule(LinkExtractor(allow_domains=(), deny_domains=deny_domains), callback='parse_external', follow=False),
)


来源:https://stackoverflow.com/questions/42558174/dynamic-rules-based-on-start-urls-for-scrapy-crawlspider

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!