How to avoid re-downloading media to S3 in Scrapy?

问题

I previously asked a similar question (How does Scrapy avoid re-downloading media that was downloaded recently?), but since I did not receive a definite answer I'll ask it again.

I've downloaded a large number of files to an AWS S3 bucket using Scrapy's Files Pipeline. According to the documentation (https://doc.scrapy.org/en/latest/topics/media-pipeline.html#downloading-and-processing-files-and-images), this pipeline avoids "re-downloading media that was downloaded recently", but it does not say how long ago "recent" is or how to set this parameter.

Looking at the implementation of the FilesPipeline class at https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py, it would appear that this is obtained from the FILES_EXPIRES setting, for which the default is 90 days:

class FilesPipeline(MediaPipeline):
    """Abstract pipeline that implement the file downloading
    This pipeline tries to minimize network transfers and file processing,
    doing stat of the files and determining if file is new, uptodate or
    expired.
    `new` files are those that pipeline never processed and needs to be
        downloaded from supplier site the first time.
    `uptodate` files are the ones that the pipeline processed and are still
        valid files.
    `expired` files are those that pipeline already processed but the last
        modification was made long time ago, so a reprocessing is recommended to
        refresh it in case of change.
    """

    MEDIA_NAME = "file"
    EXPIRES = 90
    STORE_SCHEMES = {
        '': FSFilesStore,
        'file': FSFilesStore,
        's3': S3FilesStore,
    }
    DEFAULT_FILES_URLS_FIELD = 'file_urls'
    DEFAULT_FILES_RESULT_FIELD = 'files'

    def __init__(self, store_uri, download_func=None, settings=None):
        if not store_uri:
            raise NotConfigured

        if isinstance(settings, dict) or settings is None:
            settings = Settings(settings)

        cls_name = "FilesPipeline"
        self.store = self._get_store(store_uri)
        resolve = functools.partial(self._key_for_pipe,
                                    base_class_name=cls_name,
                                    settings=settings)
        self.expires = settings.getint(
            resolve('FILES_EXPIRES'), self.EXPIRES
        )
        if not hasattr(self, "FILES_URLS_FIELD"):
            self.FILES_URLS_FIELD = self.DEFAULT_FILES_URLS_FIELD
        if not hasattr(self, "FILES_RESULT_FIELD"):
            self.FILES_RESULT_FIELD = self.DEFAULT_FILES_RESULT_FIELD
        self.files_urls_field = settings.get(
            resolve('FILES_URLS_FIELD'), self.FILES_URLS_FIELD
        )
        self.files_result_field = settings.get(
            resolve('FILES_RESULT_FIELD'), self.FILES_RESULT_FIELD
        )

        super(FilesPipeline, self).__init__(download_func=download_func, settings=settings)

    @classmethod
    def from_settings(cls, settings):
        s3store = cls.STORE_SCHEMES['s3']
        s3store.AWS_ACCESS_KEY_ID = settings['AWS_ACCESS_KEY_ID']
        s3store.AWS_SECRET_ACCESS_KEY = settings['AWS_SECRET_ACCESS_KEY']
        s3store.POLICY = settings['FILES_STORE_S3_ACL']

        store_uri = settings['FILES_STORE']
        return cls(store_uri, settings=settings)

    def _get_store(self, uri):
        if os.path.isabs(uri):  # to support win32 paths like: C:\\some\dir
            scheme = 'file'
        else:
            scheme = urlparse(uri).scheme
        store_cls = self.STORE_SCHEMES[scheme]
        return store_cls(uri)

    def media_to_download(self, request, info):
        def _onsuccess(result):
            if not result:
                return  # returning None force download

            last_modified = result.get('last_modified', None)
            if not last_modified:
                return  # returning None force download

            age_seconds = time.time() - last_modified
            age_days = age_seconds / 60 / 60 / 24
            if age_days > self.expires:
                return  # returning None force download

Do I understand this correctly? Also, I do not see a similar Boolean statement with age_days in the S3FilesStore class; is the checking of age also implemented for files on S3? (I was also unable to find any tests testing this age-checking feature for S3).

回答1:

FILES_EXPIRES is indeed the setting to tell the FilesPipeline how "old" can a file be before downloading it (again).

The key section of the code is in media_to_download: the _onsuccess callback checks the result of the pipeline's self.store.stat_file call, and for your question, it especially looks for the "last_modified" info. If last modified is older than "expires days", then the download is triggered.

You can check how the S3store gets the "last modified" information. It depends if botocore is available or not.

回答2:

One line answer to this would be - class FilesPipeline(MediaPipeline): is the only class responsible for managing, validating and downloading files in your local paths. class S3FilesStore(object): just gets the files from local paths and uploads them to S3.

class FSFilesStore is the one which manages all your local paths and FilesPipeline uses them to store your files at local.

Links:

https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py#L264 https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py#L397 https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py#L299

来源：https://stackoverflow.com/questions/44824013/how-to-avoid-re-downloading-media-to-s3-in-scrapy

标签

python

amazon-s3

scrapy