Is it OK for Scrapy's request_fingerprint method to return None?

与世无争的帅哥 提交于 2019-12-25 16:59:29

问题


I'd like to override Scrapy's default RFPDupefilter class as follows:

from scrapy.dupefilters import RFPDupeFilter

class URLDupefilter(RFPDupeFilter):

    def request_fingerprint(self, request):
        if not request.url.endswith('.xml'):
            return request.url

The rationale is that I would like to make the requests.seen 'human-readable' by using the scraped URLs (which are sufficiently unique) rather than a hash. However, I would like to omit URLs ending with .xml (which correspond to sitemap pages).

Like this, the request_fingerprint method will return None if the requests URL ends with .xml. Is this a valid implementation of a dupefilter?


回答1:


If you look into request_seen() method of DupeFilter class you can see how scrapy compares fingerprints:

def request_seen(self, request):
    fp = self.request_fingerprint(request)
    if fp in self.fingerprints:
        return True
    self.fingerprints.add(fp)
    if self.file:
        self.file.write(fp + os.linesep)

fp in self.fingerprints, in your case this would resolve to None in {None}, since your fingerprint is None and self.fingerprints is a set type object. This is valid python and resolves properly.
So yes, you can return None.

Edit: However this will let through first xml response, since the fingerprints set will not have None fingerprint in it yet. Ideally you want to fix request_seen method in your dupefilter as well to simply return False if fingerprint is None.



来源:https://stackoverflow.com/questions/44370949/is-it-ok-for-scrapys-request-fingerprint-method-to-return-none

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!