Scrapy 404 error: HTTP status code is not handled or not allowed

前端 未结 2 706
再見小時候
再見小時候 2020-12-19 18:51

I\'m trying to scrape the site coursetalk using scrapy, I\'m trying with the spider template first and getting a 404 error:

2017-12-29 23:34:30 [scrapy] DEB         


        
相关标签:
2条回答
  • 2020-12-19 19:06

    I have faced this problem with scrapy and solved it.

    Changed USER_AGENT in setting.py

    USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36"

    0 讨论(0)
  • 2020-12-19 19:10

    Looks like this website is so weird that response status code is 404 but still can fetch the body normally.

    And in Scrapy HttpErrorMiddleware is default enabled ,which would filter out unsuccessful Http responses so that spiders don't have to deal with them.And in this case,scrapy provides HTTPERROR_ALLOWED_CODES setting to allows to deal with response even if returning error codes.

    And adding HTTPERROR_ALLOWED_CODES =[404] in the project setting.py would bypass this issue

    import scrapy
    import logging
    
    class ListaDeCursosSpider(scrapy.Spider):
        name = "lista_de_cursos"
        allowed_domains = ['www.coursetalk.com']
        start_urls = ['https://www.coursetalk.com/subjects/data-science/courses/'] 
    
     def parse(self, response):
            logging.info("response.status:%s"%response.status)
            logourl = response.selector.css('div.main-nav__logo img').xpath('@src').extract()
            logging.info('response.logourl:%s'%logourl)
    
    0 讨论(0)
提交回复
热议问题