Setting sticky cookie in scrapy

痞子三分冷 提交于 2019-12-25 01:44:01

问题


The website I am scraping has javascript that sets a cookie and checks it in the backend to make sure js is enabled. Extracting the cookie from the html code is simple enough, but then setting it seems to be a problem in scrapy. So my code is:

from scrapy.contrib.spiders.init import InitSpider

class TestSpider(InitSpider):
    ...
    rules = (Rule(SgmlLinkExtractor(allow=('products/./index\.html', )), callback='parse_page'),)

    def init_request(self):
        return Request(url = self.init_url, callback=self.parse_js)

    def parse_js(self, response):
        match = re.search('setCookie\(\'(.+?)\',\s*?\'(.+?)\',', response.body, re.M)
        if match:
            cookie = match.group(1)
            value = match.group(2)
        else:
            raise BaseException("Did not find the cookie", response.body)
        return Request(url=self.test_page, callback=self.check_test_page, cookies={cookie:value})

    def check_test_page(self, response):
        if 'Welcome' in response.body:
            self.initialized()

    def parse_page(self, response):
        scraping....

I can see that the content is available in check_test_page, the cookie works perfectly. But it never even gets to parse_page since CrawlSpider without the right cookie doesn't see any links. Is there a way to set a cookie for the duration of the scraping session? Or do I have to use BaseSpider and add the cookie to every request manually?

A less desirable alternative would be to set the cookie (the value seems to never change) through scrapy configuration files somehow. Is that possible?


回答1:


I haven't used InitSpider before.

Looking at the code in scrapy.contrib.spiders.init.InitSpider i see:

def initialized(self, response=None):
    """This method must be set as the callback of your last initialization
    request. See self.init_request() docstring for more info.
    """
    self._init_complete = True
    reqs = self._postinit_reqs[:]
    del self._postinit_reqs
    return reqs

def init_request(self):
    """This function should return one initialization request, with the
    self.initialized method as callback. When the self.initialized method
    is called this spider is considered initialized. If you need to perform
    several requests for initializing your spider, you can do so by using
    different callbacks. The only requirement is that the final callback
    (of the last initialization request) must be self.initialized. 

    The default implementation calls self.initialized immediately, and
    means that no initialization is needed. This method should be
    overridden only when you need to perform requests to initialize your
    spider
    """
    return self.initialized()

You wrote:

I can see that the content is available in check_test_page, the cookie works perfectly. But it never even gets to parse_page since CrawlSpider without the right cookie doesn't see any links.

I think parse_page is not called because you didn't make a Request with self.initialized as the callback.

I think this should work:

def check_test_page(self, response):
    if 'Welcome' in response.body:
        return self.initialized()



回答2:


It turned out that InitSpider is a BaseSpider. So it looks like 1) there's no way to use CrawlSpider in this situation 2) there's no way to set a sticky cookie



来源:https://stackoverflow.com/questions/11949667/setting-sticky-cookie-in-scrapy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!