Lossy compression of numpy array (image, uint8) in memory

泄露秘密 提交于 2019-12-24 21:53:17

问题


I am trying to load a data set of 1.000.000 images into memory. As standard numpy arrays (uint8) all images combined fill around 100 GB of RAM, but I need to get this down to < 50 GB while still being able to quickly read the images back into numpy (that's the whole point of keeping everything in memory). Lossless compression like blosc only reduces file size by around 10%, so I went to JPEG compression. Minimum example:

import io
from PIL import Image

numpy_array = (255 * np.random.rand(256, 256, 3)).astype(np.uint8)
image = Image.fromarray(numpy_array)
output = io.BytesIO()
image.save(output, format='JPEG')

At runtime I am reading the images with:

[np.array(Image.open(output)) for _ in range(1000)]

JPEG compression is very effective (< 10 GB), but the time it takes to read 1000 images back into numpy array is around 2.3 seconds, which seriously hurts the performance of my experiments. I am searching for suggestions that give a better trade-off between compression and read-speed.


回答1:


I am still not certain I understand what you are trying to do, but I created some dummy images and did some tests as follows. I'll show how I did that in case other folks feel like trying other methods and want a data set.

First, I created 1,000 images using GNU Parallel and ImageMagick like this:

parallel convert -depth 8 -size 256x256 xc:red +noise random -fill white -gravity center -pointsize 72 -annotate 0 "{}" -alpha off s_{}.png ::: {0..999}

That gives me 1,000 images called s_0.png through s_999.png and image 663 looks like this:

Then I did what I think you are trying to do - though it is hard to tell from your code:

#!/usr/local/bin/python3

import io
import time
import numpy as np
from PIL import Image

# Create BytesIO object
output = io.BytesIO()

# Load all 1,000 images and write into BytesIO object
for i in range(1000):
   name="s_{}.png".format(i)
   print("Opening image: {}".format(name))
   im = Image.open(name)
   im.save(output, format='JPEG',quality=50)
   nbytes = output.getbuffer().nbytes
   print("BytesIO size: {}".format(nbytes))

# Read back images from BytesIO ito list
start=time.clock()
l=[np.array(Image.open(output)) for _ in range(1000)]
diff=time.clock()-start
print("Time: {}".format(diff))

And that takes 2.4 seconds to read all 1,000 images from the BytesIO object and turn them into numpy arrays.

Then, I palettised the images by reducing to 256 colours (which I agree is lossy - just as your method) and saved a list of palettised image objects which I can readily later convert back to numpy arrays by simply calling:

np.array(ImageList[i].convert('RGB'))

Storing the data as a palettised image saves 66% of the space because you only store one byte of palette index per pixel rather than 3 bytes of RGB, so it is better than the 50% compression you seek.

#!/usr/local/bin/python3

import io
import time
import numpy as np
from PIL import Image

# Empty list of images
ImageList = []

# Load all 1,000 images 
for i in range(1000):
   name="s_{}.png".format(i)
   print("Opening image: {}".format(name))
   im = Image.open(name)
   # Add palettised image to list
   ImageList.append(im.quantize(colors=256, method=2))

# Read back images into numpy arrays
start=time.clock()
l=[np.array(ImageList[i].convert('RGB')) for i in range(1000)]
diff=time.clock()-start
print("Time: {}".format(diff))

# Quick test
# Image.fromarray(l[999]).save("result.png")

That now takes 0.2s instead of 2.4s - let's hope the loss of colour accuracy is acceptable to your unstated application :-)



来源:https://stackoverflow.com/questions/51667881/lossy-compression-of-numpy-array-image-uint8-in-memory

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!