Scrapy and response status code: how to check against it?

你离开我真会死。 提交于 2019-11-27 01:37:25

问题


I'm using scrapy to crawl my sitemap, to check for 404, 302 and 200 pages. But i can't seem to be able to get the response code. This is my code so far:

from scrapy.contrib.spiders import SitemapSpider


class TothegoSitemapHomesSpider(SitemapSpider):
    name ='tothego_homes_spider'

    ## robe che ci servono per tothego ##
   sitemap_urls = []
   ok_log_file =       '/opt/Workspace/myapp/crawler/valid_output/ok_homes'
   bad_log_file =      '/opt/Workspace/myapp/crawler/bad_homes'
   fourohfour =        '/opt/Workspace/myapp/crawler/404/404_homes'

   def __init__(self, **kwargs):
        SitemapSpider.__init__(self)

        if len(kwargs) > 1:
            if 'domain' in kwargs:
                self.sitemap_urls = ['http://url_to_sitemap%s/sitemap.xml' % kwargs['domain']]

            if 'country' in kwargs:
                self.ok_log_file += "_%s.txt" % kwargs['country']
                self.bad_log_file += "_%s.txt" % kwargs['country']
                self.fourohfour += "_%s.txt" % kwargs['country']

        else:
            print "USAGE: scrapy [crawler_name] -a country=[country] -a domain=[domain] \nWith [crawler_name]:\n- tothego_homes_spider\n- tothego_cars_spider\n- tothego_jobs_spider\n"
            exit(1)

    def parse(self, response):
        try:
            if response.status == 404:
                ## 404 tracciate anche separatamente
                self.append(self.bad_log_file, response.url)
                self.append(self.fourohfour, response.url)

            elif response.status == 200:
                ## printa su ok_log_file
                self.append(self.ok_log_file, response.url)
            else:
                self.append(self.bad_log_file, response.url)

        except Exception, e:
            self.log('[eccezione] : %s' % e)
            pass

    def append(self, file, string):
        file = open(file, 'a')
        file.write(string+"\n")
        file.close()

From scrapy's docs, they said that response.status parameter is an integer corresponding to the status code of the response. So far, it logs only the 200 status urls, while the 302 aren't written on the output file (but i can see the redirects in crawl.log). So, what do i have to do to "trap" the 302 requests and save those urls?


回答1:


http://readthedocs.org/docs/scrapy/en/latest/topics/spider-middleware.html#module-scrapy.contrib.spidermiddleware.httperror

Assuming default spider middleware is enabled, response codes outside of the 200-300 range are filtered out by HttpErrorMiddleware. You can tell the middleware you want to handle 404s by setting the handle_httpstatus_list attribute on your spider.

class TothegoSitemapHomesSpider(SitemapSpider):
    handle_httpstatus_list = [404]



回答2:


Only to have a complete response here:

  • Set Handle_httpstatus_list = [302];

  • On request, set dont_redirect to True on meta.

For example: Request(URL, meta={'dont_redirect': True});



来源:https://stackoverflow.com/questions/9698372/scrapy-and-response-status-code-how-to-check-against-it

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!