scrapy- how to stop Redirect (302)

余生长醉 提交于 2019-11-27 11:41:44

问题


I'm trying to crawl a url using Scrapy. But it redirects me to page that doesn't exist.

Redirecting (302) to <GET http://www.shop.inonit.in/mobile/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/1275197> from <GET http://www.shop.inonit.in/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/pid-1275197.aspx>

The problem is http://www.shop.inonit.in/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/pid-1275197.aspx exists, but http://www.shop.inonit.in/mobile/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/1275197 doesn't, so the crawler cant find this. I've crawled many other websites as well but didn't have this problem anywhere else. Is there a way I can stop this redirect?

Any help would be much appreciated. Thanks.

Update: This is my spider class

class Inon_Spider(BaseSpider):
name = 'Inon'
allowed_domains = ['www.shop.inonit.in']

start_urls = ['http://www.shop.inonit.in/Products/Inonit-Gadget-Accessories-Mobile-Covers/-The-Red-Tag/Samsung-Note-2-Dead-Mau/pid-2656465.aspx']

def parse(self, response):

    item = DealspiderItem()
    hxs = HtmlXPathSelector(response)

    title = hxs.select('//div[@class="aboutproduct"]/div[@class="container9"]/div[@class="ctl_aboutbrand"]/h1/text()').extract()
    price = hxs.select('//span[@id="ctl00_ContentPlaceHolder1_Price_ctl00_spnWebPrice"]/span[@class="offer"]/span[@id="ctl00_ContentPlaceHolder1_Price_ctl00_lblOfferPrice"]/text()').extract()
    prc = price[0].replace("Rs.  ","")
    description = []

    item['price'] = prc
    item['title'] = title
    item['description'] = description
    item['url'] = response.url

    return item

回答1:


yes you can do this simply by adding meta values like

meta={'dont_redirect': True}

also you can stop redirected for a particular response code like

 meta={'dont_redirect': True,"handle_httpstatus_list": [302]}

it will stop redirecting only 302 response codes.

example

        yield Request('some url',
                  meta = {
                      'dont_redirect': True,
                      'handle_httpstatus_list': [302]
                  },
                  callback= self.some_call_back)



回答2:


By default, Scrapy use RedirectMiddleware to handle redirection. You can set REDIRECT_ENABLED to False to disable redirection.

See documentation.




回答3:


After looking at the documentation and looking through the relevant source, I was able to figure it out. If you look in the source for start_requests, you'll see that it calls make_requests_from_url for all URLs.

Instead of modifying start_requests, I modified make_requests_from_url

    def make_requests_from_url(self, url):
        return Request(url, dont_filter=True, meta = {
                  'dont_redirect': True,
                  'handle_httpstatus_list': [301,302]
            })

And added this as part of my spider, right above parse().




回答4:


As explained here: Scrapy docs

Use Request Meta

request =  scrapy.Request( link.url, callback=self.parse2)
request.meta['dont_redirect'] = True
yield request


来源:https://stackoverflow.com/questions/15476587/scrapy-how-to-stop-redirect-302

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!