Is it possible to remove requests from scrapys scheduler queue?

删除回忆录丶 提交于 2019-12-24 00:47:44

问题


Is it possible to remove requests from scrapy's scheduler queue? I have a working routine that limits crawling to a certain domain for a set amount of time. It's working in the sense that it will not yield anymore links once the time limit was hit but as the queue can already contain thousands of requests for the domain I'd like to remove them from the scheduler queue once the time limit is hit.


回答1:


Okay so I ended up following the suggestion from @rickgh12hs and wrote my own Downloader Middleware:

from scrapy.exceptions import IgnoreRequest
import tldextract

class clearQueueDownloaderMiddleware(object):
    def process_request(self, request, spider):
        domain_obj = tldextract.extract(request.url)
        just_domain = domain_obj.registered_domain
        if(just_domain in spider.blocked):
            print "Blocked domain: %s (url: %s)" % (just_domain, request.url)
            raise IgnoreRequest("URL blocked: %s" % request.url)

spider.blocked is a class list variable that contains blocked domains preventing any further downloads from the blocked domains. Seem to work great, cudos to @rickgh12hs!



来源:https://stackoverflow.com/questions/30438650/is-it-possible-to-remove-requests-from-scrapys-scheduler-queue

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!