Scrapy Deploy Doesn't Match Debug Result

问题

I am using Scrapy to extract some data from a site, say "myproject.com". Here is the logic:

Go to the homepage, and there are some categorylist that to be used to build the second wave of links.
For the second round of links, they are usually the first page from each category. Also, for different pages inside that category, they follow the same regular expression pattern wholesale/something/something/request or wholesale/pagenumber. And I want to follow those patterns to keep crawling and meanwhile store the raw HTML in my item object.

I tested these two steps separately by using the parse and they both worked.

First, I tried:

scrapy parse http://www.myproject.com/categorylist/cat_a --spider myproject --rules

And I can see it built the outlinks successfully. Then I tested the built outlink again.

scrapy parse http://www.myproject.com/wholesale/cat_a/request/1 --spider myproject --rules

And seems like the rule is correct and it generate a item with the HTML stored in there.

However, when I tried to link those two steps together by using the depth argument. I saw it crawled the outlinks but no items got generated.

scrapy parse http://www.myproject.com/categorylist/cat_a --spider myproject --rules --depth 2

Here is the pseudo code:

class MyprojectSpider(CrawlSpider):
    name = "Myproject"
    allowed_domains = ["Myproject.com"]
    start_urls = ["http://www.Myproject.com/"]

    rules = (
        Rule(LinkExtractor(allow=('/categorylist/\w+',)), callback='parse_category', follow=True),
        Rule(LinkExtractor(allow=('/wholesale/\w+/(?:wholesale|request)/\d+',)), callback='parse_pricing', follow=True),
    )

    def parse_category(self, response):
        try:
            soup = BeautifulSoup(response.body)
            ...
            my_request1 = Request(url=myurl1)
            yield my_request1
            my_request2 = Request(url=myurl2)
            yield my_request2
        except:
            pass

    def parse_pricing(self, response):
        item = MyprojectItem()
        try:
            item['myurl'] = response.url
            item['myhtml'] = response.body
            item['mystatus'] = 'fetched'
        except:
            item['mystatus'] = 'failed'
        return item

Thanks a lot for any suggestion!

回答1:

I was assuming the new Request objects that I built will run against the rules and then be parsed by the corresponding callback function define in the Rule, however, after reading the documentation of Request, the callback method is handled in a different way.

class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback])

callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesn’t specify a callback, the spider’s parse() method will be used. Note that if exceptions are raised during processing, errback is called instead.

...
my_request1 = Request(url=myurl1, callback=self.parse_pricing)
yield my_request1
my_request2 = Request(url=myurl2, callback=self.parse_pricing)
yield my_request2
...

In another way, even if the URLs I built matches the second rule, it won't be passed to parse_pricing. Hope this is helpful to other people.

来源：https://stackoverflow.com/questions/25530998/scrapy-deploy-doesnt-match-debug-result

标签

python

regex

web-scraping

scrapy

web-crawler