Scrapy image download how to use custom filename

匿名 (未验证) 提交于 2019-12-03 08:44:33

问题:

For my scrapy project I'm currently using the ImagesPipeline. The downloaded images are stored with a SHA1 hash of their URLs as the file names.

How can I store the files using my own custom file names instead?

What if my custom file name needs to contain another scraped field from the same item? e.g. use the item['desc'] and the filename for the image with item['image_url']. If I understand correctly, that would involve somehow accessing the other item fields from the Image Pipeline.

Any help will be appreciated.

回答1:

This was the way I solved the problem in Scrapy 0.10 . Check the method persist_image of FSImagesStoreChangeableDirectory. The filename of the downloaded image is key

class FSImagesStoreChangeableDirectory(FSImagesStore):      def persist_image(self, key, image, buf, info,append_path):          absolute_path = self._get_filesystem_path(append_path+'/'+key)         self._mkdir(os.path.dirname(absolute_path), info)         image.save(absolute_path)  class ProjectPipeline(ImagesPipeline):      def __init__(self):         super(ImagesPipeline, self).__init__()         store_uri = settings.IMAGES_STORE         if not store_uri:             raise NotConfigured         self.store = FSImagesStoreChangeableDirectory(store_uri) 


回答2:

This is just actualization of the answer for scrapy 0.24 (EDITED), where the image_key() is deprecated

class MyImagesPipeline(ImagesPipeline):      #Name download version     def file_path(self, request, response=None, info=None):         #item=request.meta['item'] # Like this you can use all from item, not just url.         image_guid = request.url.split('/')[-1]         return 'full/%s' % (image_guid)      #Name thumbnail version     def thumb_path(self, request, thumb_id, response=None, info=None):         image_guid = thumb_id + response.url.split('/')[-1]         return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)      def get_media_requests(self, item, info):         #yield Request(item['images']) # Adding meta. Dunno how to put it in one line :-)         for image in item['images']:             yield Request(image) 


回答3:

In scrapy 0.12 I solved something like this

class MyImagesPipeline(ImagesPipeline):      #Name download version     def image_key(self, url):         image_guid = url.split('/')[-1]         return 'full/%s.jpg' % (image_guid)      #Name thumbnail version     def thumb_key(self, url, thumb_id):         image_guid = thumb_id + url.split('/')[-1]         return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)      def get_media_requests(self, item, info):         yield Request(item['images']) 


回答4:

I found my way in 2017,scrapy 1.1.3

def file_path(self, request, response=None, info=None):     return request.meta.get('filename','')  def get_media_requests(self, item, info):     img_url = item['img_url']     meta = {'filename': item['name']}     yield Request(url=img_url, meta=meta) 

like the code above,you can add the name you want to a Request meta in get_media_requests(), and get it back in file_path() by request.meta.get('yourname','').



回答5:

I did a nasty quick hack for that. In my case, I stored the title of image in my feeds. And, I had only 1 image_urls per item, so, I wrote the following script. It basically renames the image files in the /images/full/ directory with the corresponding title in the item feed that I had stored in as json.

import os import json  img_dir = os.path.join(os.getcwd(), 'images\\full') item_dir = os.path.join(os.getcwd(), 'data.json')  with open(item_dir, 'r') as item_json:     items = json.load(item_json)  for item in items:     if len(item['images']) > 0:         cur_file = item['images'][0]['path'].split('/')[-1]         cur_format = cur_file.split('.')[-1]         new_title = item['title']+'.%s'%cur_format         file_path = os.path.join(img_dir, cur_file)         os.rename(file_path, os.path.join(img_dir, new_title)) 

It's nasty & not recommended. But, it is a naive alternative approach.



回答6:

I rewrite the code, changing, in thumb_path def, "response." by "request.". If no, it won't work because "response is set to None".

class MyImagesPipeline(ImagesPipeline):      #Name download version     def file_path(self, request, response=None, info=None):         #item=request.meta['item'] # Like this you can use all from item, not just url.         image_guid = request.url.split('/')[-1]         return 'full/%s' % (image_guid)      #Name thumbnail version     def thumb_path(self, request, thumb_id, response=None, info=None):         image_guid = thumb_id + request.url.split('/')[-1]         return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)      def get_media_requests(self, item, info):         #yield Request(item['images']) # Adding meta. Dunno how to put it in one line :-)         for image in item['images']:             yield Request(image) 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!