问题
Is it possible to remove requests from scrapy's scheduler queue? I have a working routine that limits crawling to a certain domain for a set amount of time. It's working in the sense that it will not yield anymore links once the time limit was hit but as the queue can already contain thousands of requests for the domain I'd like to remove them from the scheduler queue once the time limit is hit.
回答1:
Okay so I ended up following the suggestion from @rickgh12hs and wrote my own Downloader Middleware:
from scrapy.exceptions import IgnoreRequest
import tldextract
class clearQueueDownloaderMiddleware(object):
def process_request(self, request, spider):
domain_obj = tldextract.extract(request.url)
just_domain = domain_obj.registered_domain
if(just_domain in spider.blocked):
print "Blocked domain: %s (url: %s)" % (just_domain, request.url)
raise IgnoreRequest("URL blocked: %s" % request.url)
spider.blocked is a class list variable that contains blocked domains preventing any further downloads from the blocked domains. Seem to work great, cudos to @rickgh12hs!
来源:https://stackoverflow.com/questions/30438650/is-it-possible-to-remove-requests-from-scrapys-scheduler-queue