Avoid Duplicate URL Crawling

后端未结

关注

 3  711

灰色年华 2020-12-13 15:47

I coded a simple crawler. In the settings.py file, by referring to scrapy documentation, I used

DUPEFILTER_CLASS = \'scrapy.dupefilter.RFPDupeFilter\'


      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   Happy的楠姐
                                             
                
                
                (楼主)
            
              
              
                2020-12-13 16:30
              

            
            
                        
According to the documentation, DUPEFILTER_CLASS is already set to scrapy.dupefilter.RFPDupeFilter by default.

RFPDupeFilter doesn't help if you stop the crawler - it only works while actual crawling, helps you to avoid scraping duplicate urls.

It looks like you need to create your own, custom filter based on RFPDupeFilter, like it was done here: how to filter duplicate requests based on url in scrapy. If you want your filter to work between scrapy crawl sessions, you should keep the list of crawled urls somewhere in the database, or csv file.

Hope that helps.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复