可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
For my scrapy project I'm currently using the ImagesPipeline. The downloaded images are stored with a SHA1 hash of their URLs as the file names.
How can I store the files using my own custom file names instead?
What if my custom file name needs to contain another scraped field from the same item? e.g. use the item['desc']
and the filename for the image with item['image_url']
. If I understand correctly, that would involve somehow accessing the other item fields from the Image Pipeline.
Any help will be appreciated.
回答1:
This was the way I solved the problem in Scrapy 0.10 . Check the method persist_image of FSImagesStoreChangeableDirectory. The filename of the downloaded image is key
class FSImagesStoreChangeableDirectory(FSImagesStore): def persist_image(self, key, image, buf, info,append_path): absolute_path = self._get_filesystem_path(append_path+'/'+key) self._mkdir(os.path.dirname(absolute_path), info) image.save(absolute_path) class ProjectPipeline(ImagesPipeline): def __init__(self): super(ImagesPipeline, self).__init__() store_uri = settings.IMAGES_STORE if not store_uri: raise NotConfigured self.store = FSImagesStoreChangeableDirectory(store_uri)
回答2:
This is just actualization of the answer for scrapy 0.24 (EDITED), where the image_key()
is deprecated
class MyImagesPipeline(ImagesPipeline): #Name download version def file_path(self, request, response=None, info=None): #item=request.meta['item'] # Like this you can use all from item, not just url. image_guid = request.url.split('/')[-1] return 'full/%s' % (image_guid) #Name thumbnail version def thumb_path(self, request, thumb_id, response=None, info=None): image_guid = thumb_id + response.url.split('/')[-1] return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid) def get_media_requests(self, item, info): #yield Request(item['images']) # Adding meta. Dunno how to put it in one line :-) for image in item['images']: yield Request(image)
回答3:
In scrapy 0.12 I solved something like this
class MyImagesPipeline(ImagesPipeline): #Name download version def image_key(self, url): image_guid = url.split('/')[-1] return 'full/%s.jpg' % (image_guid) #Name thumbnail version def thumb_key(self, url, thumb_id): image_guid = thumb_id + url.split('/')[-1] return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid) def get_media_requests(self, item, info): yield Request(item['images'])
回答4:
I found my way in 2017,scrapy 1.1.3
def file_path(self, request, response=None, info=None): return request.meta.get('filename','') def get_media_requests(self, item, info): img_url = item['img_url'] meta = {'filename': item['name']} yield Request(url=img_url, meta=meta)
like the code above,you can add the name you want to a Request meta in get_media_requests()
, and get it back in file_path()
by request.meta.get('yourname','')
.
回答5:
I did a nasty quick hack for that. In my case, I stored the title of image in my feeds. And, I had only 1 image_urls
per item, so, I wrote the following script. It basically renames the image files in the /images/full/
directory with the corresponding title in the item feed that I had stored in as json.
import os import json img_dir = os.path.join(os.getcwd(), 'images\\full') item_dir = os.path.join(os.getcwd(), 'data.json') with open(item_dir, 'r') as item_json: items = json.load(item_json) for item in items: if len(item['images']) > 0: cur_file = item['images'][0]['path'].split('/')[-1] cur_format = cur_file.split('.')[-1] new_title = item['title']+'.%s'%cur_format file_path = os.path.join(img_dir, cur_file) os.rename(file_path, os.path.join(img_dir, new_title))
It's nasty & not recommended. But, it is a naive alternative approach.
回答6:
I rewrite the code, changing, in thumb_path def, "response." by "request.". If no, it won't work because "response is set to None".
class MyImagesPipeline(ImagesPipeline): #Name download version def file_path(self, request, response=None, info=None): #item=request.meta['item'] # Like this you can use all from item, not just url. image_guid = request.url.split('/')[-1] return 'full/%s' % (image_guid) #Name thumbnail version def thumb_path(self, request, thumb_id, response=None, info=None): image_guid = thumb_id + request.url.split('/')[-1] return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid) def get_media_requests(self, item, info): #yield Request(item['images']) # Adding meta. Dunno how to put it in one line :-) for image in item['images']: yield Request(image)