Should I create pipeline to save files with scrapy?

后端 未结 3 1871
旧时难觅i
旧时难觅i 2020-12-13 08:12

I need to save a file (.pdf) but I\'m unsure how to do it. I need to save .pdfs and store them in such a way that they are organized in a directories much like they are stor

3条回答
  •  鱼传尺愫
    2020-12-13 08:34

    Yes and no[1]. If you fetch a pdf it will be stored in memory, but if the pdfs are not big enough to fill up your available memory so it is ok.

    You could save the pdf in the spider callback:

    def parse_listing(self, response):
        # ... extract pdf urls
        for url in pdf_urls:
            yield Request(url, callback=self.save_pdf)
    
    def save_pdf(self, response):
        path = self.get_path(response.url)
        with open(path, "wb") as f:
            f.write(response.body)
    

    If you choose to do it in a pipeline:

    # in the spider
    def parse_pdf(self, response):
        i = MyItem()
        i['body'] = response.body
        i['url'] = response.url
        # you can add more metadata to the item
        return i
    
    # in your pipeline
    def process_item(self, item, spider):
        path = self.get_path(item['url'])
        with open(path, "wb") as f:
            f.write(item['body'])
        # remove body and add path as reference
        del item['body']
        item['path'] = path
        # let item be processed by other pipelines. ie. db store
        return item
    

    [1] another approach could be only store pdfs' urls and use another process to fetch the documents without buffering into memory. (e.g. wget)

提交回复
热议问题