crawlSpider seems not to follow rule

柔情痞子 提交于 2021-02-11 14:32:22

问题


here's my code. Actually I followed the example in "Recursively Scraping Web Pages With Scrapy" and it seems I have included a mistake somewhere.

Can someone help me find it, please? It's driving me crazy, I only want all the results from all the result pages. Instead it gives me the results from page 1.

Here's my code:

import scrapy

from scrapy.selector import Selector
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http.request import Request
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from githubScrape.items import GithubscrapeItem


class GithubSpider(CrawlSpider):
    name = "github2"
    allowed_domains = ["github.com"]

    rules = (
        Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//*[contains(@class, "next_page")]')), callback='parse_items', follow=True),
    )

    def start_requests(self):

        baseURL = 'https://github.com/search?utf8=%E2%9C%93&q=eagle+SYSTEM+extension%3Asch+size%3A'
        for i in range(10000, 20000, +5000):
            url = baseURL+str(i+1)+".."+str(i+5000)+'&type=Code&ref=searchresults'
            print "URL:",url
            yield Request(url, callback=self.parse_items)


    def parse_items(self, response):

        hxs = Selector(response)
        resultParagraphs = hxs.xpath('//div[contains(@id,"code_search_results")]//p[contains(@class, "title")]')

        items = []
        for p in resultParagraphs:
            hrefs = p.xpath('a/@href').extract()
            projectURL = hrefs[0]
            schemeURL = hrefs[1]
            lastIndexedOn = p.xpath('.//span/time/@datetime').extract()

            i = GithubscrapeItem()
            i['counter'] = self.count
            i['projectURL'] = projectURL
            i['schemeURL'] = schemeURL
            i['lastIndexedOn'] = lastIndexedOn
            items.append(i)
        return(items)

回答1:


I didn't find your code on the link you passed, but I think the problem is that you are never using the rules.

Scrapy starts crawling by calling the start_requests method, but the rules are compiled and used on the parse method, which you are not using because your requests go directly from start_requests to parse_items.

You could remove the callback on the start_requests method if you want the rules to be applied on that level.



来源:https://stackoverflow.com/questions/34340875/crawlspider-seems-not-to-follow-rule

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!