python pickle - dumping a very huge list

北城以北 提交于 2019-12-07 03:22:27

The pickle file format isn't particularly efficient, especially not for images. Even if your pixels were stored as 1 byte per pixel, you would have

50,000 × 240 × 180 = 2,160,000,000

so 2 GB. Your pixels undoubtedly take more space than that, I'm not sure what the PIL tostring() method actually does on an image. It's entirely plausible that your resulting file could be in the tens of gigabytes.

You may want to consider a storage method other than pickle. For example, what would be wrong with simply storing the files on disk in their native image format, and pickling a list of the file names?

I agree that you probably shouldn't be storing tons of pickled images to disk… unless you absolutely have to (for whatever reason). You should probably get a really big disk, with some really good memory, and tons of processing power.

Anyway, if you transfer your image data to a numpy.array, with scipy.ndimage.imread, then you can use the numpy internal format plus compression to store the image to disk.

There are packages like klepto that make this easy for you.

>>> from klepto.archives import dir_archive
>>> from scipy import ndimage
>>> demo = dir_archive('demo', {}, serialized=True, compression=9, cached=False)
>>> demo['image1'] = ndimage.imread('image1')
>>> demo['image2'] = ndimage.imread('image2')

Now you have a dictionary interface to numpy internal representation compressed pickled image files, with one image per file in a directory called demo (maybe you need to add the fast=True flag, I don't remember). All the dictionary methods are pretty much available, so you can access the images as you need for your analysis, then throw the pickled images away with del demo['image1'] or something similar.

You can also use klepto to easily provide custom encodings, so you have fairly cryptographic storage of your data. You can even pick to not encrypt/pickle your data, but just to have a dictionary interface to your files on disk -- that's often handy in itself.

If you don't turn caching off, you might hit the limits of your computer's memory or disk size, unless you are careful about the order you dump and load the image to disk. In the above example, I have caching to memory turned off, so it writes directly to disk. There are other options as well, such as using memory mapped mode, and writing to HDF files. I typically use a scheme like the above for large array data on to be processed on a single machine, and might pick a MySQL archive backend for more smaller data to be accessed by several machines in parallel.

Get klepto here: https://github.com/uqfoundation

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!