I\'m trying to scrape the site coursetalk using scrapy, I\'m trying with the spider template first and getting a 404 error:
2017-12-29 23:34:30 [scrapy] DEB
I have faced this problem with scrapy and solved it.
Changed USER_AGENT
in setting.py
USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36"
Looks like this website is so weird that response status code is 404 but still can fetch the body normally.
And in Scrapy HttpErrorMiddleware is default enabled ,which would filter out unsuccessful Http responses so that spiders don't have to deal with them.And in this case,scrapy provides HTTPERROR_ALLOWED_CODES
setting to allows to deal with response even if returning error codes.
And adding HTTPERROR_ALLOWED_CODES =[404]
in the project setting.py
would bypass this issue
import scrapy
import logging
class ListaDeCursosSpider(scrapy.Spider):
name = "lista_de_cursos"
allowed_domains = ['www.coursetalk.com']
start_urls = ['https://www.coursetalk.com/subjects/data-science/courses/']
def parse(self, response):
logging.info("response.status:%s"%response.status)
logourl = response.selector.css('div.main-nav__logo img').xpath('@src').extract()
logging.info('response.logourl:%s'%logourl)