Downloading pictures with scrapy

前端未结

关注

 2  2037

I\'m starting with scrapy, and I have first real problem. It\'s downloading pictures. So this is my spider.

from scrapy.contrib.spiders import CrawlSpider, R


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  野性不改        
                
              
                            
                2020-12-16 05:22
              
            
            
                                                                       
I think the image URL you scraped is relative. To construct the absolute URL use urlparse.urljoin:

def parse(self, response):
    ...
    image_relative_url = hxs.select("...").extract()[0]
    import urlparse
    image_absolute_url = urlparse.urljoin(response.url, image_relative_url.strip())
    item['image_urls'] = [image_absolute_url]
    ...




Haven't used ITEM_PIPELINES, but the docs say:


  In a Spider, you scrape an item and put the URLs of its images into a image_urls field.


So, item['image_urls'] should be a list of image URLs. But your code has:

item['image_urls'] = 'http://www.domain.com' + item['image_urls']


So, i guess it iterates your single URL char by char - using each as URL.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  甜味超标        
                
              
                            
                2020-12-16 05:39
              
            
            
                                                                       
I think that you may need to provide your image url in a list to the Item:

item['image_urls'] = [ 'http://www.domain.com' + item['image_urls'] ]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复