how to filter duplicate requests based on url in scrapy

前端 未结 5 943
挽巷
挽巷 2020-11-29 18:40

I am writing a crawler for a website using scrapy with CrawlSpider.

Scrapy provides an in-built duplicate-request filter which filters duplicate requests based on ur

5条回答
  •  猫巷女王i
    2020-11-29 18:54

    https://github.com/scrapinghub/scrapylib/blob/master/scrapylib/deltafetch.py

    This file might help you. This file creates a database of unique delta fetch key from the url ,a user pass in a scrapy.Reqeust(meta={'deltafetch_key':uniqe_url_key}). This this let you avoid duplicate requests you already have visited in the past.

    A sample mongodb implementation using deltafetch.py

            if isinstance(r, Request):
                key = self._get_key(r)
                key = key+spider.name
    
                if self.db['your_collection_to_store_deltafetch_key'].find_one({"_id":key}):
                    spider.log("Ignoring already visited: %s" % r, level=log.INFO)
                    continue
            elif isinstance(r, BaseItem):
    
                key = self._get_key(response.request)
                key = key+spider.name
                try:
                    self.db['your_collection_to_store_deltafetch_key'].insert({"_id":key,"time":datetime.now()})
                except:
                    spider.log("Ignoring already visited: %s" % key, level=log.ERROR)
            yield r
    

    eg. id = 345 scrapy.Request(url,meta={deltafetch_key:345},callback=parse)

提交回复
热议问题