Python Requests: requests.exceptions.TooManyRedirects: Exceeded 30 redirects

后端未结

关注

 3  1707

I was trying to crawl this page using python-requests library

import requests
from lxml import etree,html

url = \'http://www.amazon.in/b/ref=sa_menu_mobile_


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  我在风中等你        
                
              
                            
                2020-12-03 14:46
              
            
            
                                                                       
Increase of max_redirect is possible by explicitly specifying the count as in example below:

session = requests.Session()
session.max_redirects = 60
session.get('http://www.amazon.com')

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  说谎        
                
              
                            
                2020-12-03 14:59
              
            
            
                                                                       
You need to copy the cookie value to you header. It works on my end.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  悲哀的现实        
                
              
                            
                2020-12-03 15:00
              
            
            
                                                                       
Amazon is redirecting your request to http://www.amazon.in/b?ie=UTF8&node=976419031, which in turn redirects to http://www.amazon.in/electronics/b?ie=UTF8&node=976419031, after which you have entered a loop:

>>> loc = url
>>> seen = set()
>>> while True:
...     r = requests.get(loc, allow_redirects=False)
...     loc = r.headers['location']
...     if loc in seen: break
...     seen.add(loc)
...     print loc
... 
http://www.amazon.in/b?ie=UTF8&node=976419031
http://www.amazon.in/electronics/b?ie=UTF8&node=976419031
>>> loc
http://www.amazon.in/b?ie=UTF8&node=976419031


So your original URL A redirects no a new URL B, which redirects to C, which redirects to B, etc.

Apparently Amazon does this based on the User-Agent header, at which point it sets a cookie that following requests should send back. The following works:

>>> s = requests.Session()
>>> s.headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
>>> r = s.get(url)
>>> r
<Response [200]>


This created a session (for ease of re-use and for cookie persistence), and a copy of the Chrome user agent string. The request succeeds (returns a 200 response).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复