Scrapy 404 error: HTTP status code is not handled or not allowed

前端未结

关注

 2  711

I\'m trying to scrape the site coursetalk using scrapy, I\'m trying with the spider template first and getting a 404 error:

2017-12-29 23:34:30 [scrapy] DEB


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  再見小時候        
                
              
                            
                2020-12-19 19:06
              
            
            
                                                                       
I have faced this problem with scrapy and solved it.

Changed USER_AGENT in setting.py

USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36"
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  青春惊慌失措        
                
              
                            
                2020-12-19 19:10
              
            
            
                                                                       
Looks like this website is so weird that response status code is 404 but still can fetch the body normally. 

And in Scrapy HttpErrorMiddleware is default enabled ,which would filter out unsuccessful Http responses so that spiders don't have to deal with them.And in this case,scrapy provides HTTPERROR_ALLOWED_CODES setting to allows to deal with response even if returning error codes.

And adding HTTPERROR_ALLOWED_CODES  =[404] in the project setting.py would bypass this issue

import scrapy
import logging

class ListaDeCursosSpider(scrapy.Spider):
    name = "lista_de_cursos"
    allowed_domains = ['www.coursetalk.com']
    start_urls = ['https://www.coursetalk.com/subjects/data-science/courses/'] 

 def parse(self, response):
        logging.info("response.status:%s"%response.status)
        logourl = response.selector.css('div.main-nav__logo img').xpath('@src').extract()
        logging.info('response.logourl:%s'%logourl)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复