问题
I'd like to override Scrapy's default RFPDupefilter class as follows:
from scrapy.dupefilters import RFPDupeFilter
class URLDupefilter(RFPDupeFilter):
def request_fingerprint(self, request):
if not request.url.endswith('.xml'):
return request.url
The rationale is that I would like to make the requests.seen 'human-readable' by using the scraped URLs (which are sufficiently unique) rather than a hash. However, I would like to omit URLs ending with .xml (which correspond to sitemap pages).
Like this, the request_fingerprint method will return None if the requests URL ends with .xml. Is this a valid implementation of a dupefilter?
回答1:
If you look into request_seen() method of DupeFilter class you can see how scrapy compares fingerprints:
def request_seen(self, request):
fp = self.request_fingerprint(request)
if fp in self.fingerprints:
return True
self.fingerprints.add(fp)
if self.file:
self.file.write(fp + os.linesep)
fp in self.fingerprints, in your case this would resolve to None in {None}, since your fingerprint is None and self.fingerprints is a set type object. This is valid python and resolves properly.
So yes, you can return None.
Edit: However this will let through first xml response, since the fingerprints set will not have None fingerprint in it yet. Ideally you want to fix request_seen method in your dupefilter as well to simply return False if fingerprint is None.
来源:https://stackoverflow.com/questions/44370949/is-it-ok-for-scrapys-request-fingerprint-method-to-return-none