问题

I have a python app that imports 200k+ images, crops them, and presents the cropped image to pyzbar to interpret a barcode. Cropping helps because there are multiple barcodes on the image and, presumably pyzbar is a little faster when given smaller images.

Currently I am using Pillow to import and crop the image.

On the average importing and cropping an image takes 262 msecs and pyzbar take 8 msecs.

A typical run is about 21 hours.

I wonder if a library other than Pillow might offer substantial improvements in loading/cropping. Ideally the library should be available for MacOS but I could also run the whole thing in a virtual Ubuntu machine.

I am working on a version that can run in parallel processes which will be a big improvement but if I could get 25% or more speed increase from a different library I would also add that.

回答1:

As you didn't provide a sample image, I made a dummy file with dimensions 2544x4200 at 1.1MB in size and it is provided at the end of the answer. I made 1,000 copies of that image and processed all 1,000 images for each benchmark.

As you only gave your code in the comments area, I took it, formatted it and made the best I could of it. I also put it in a loop so it can process many files for just one invocation of the Python interpreter - this becomes important when you have 20,000 files.

That looks like this:

#!/usr/bin/env python3

import sys
from PIL import Image

# Process all input files so we only incur Python startup overhead once
for filename in sys.argv[1:]:
   print(f'Processing: {filename}')
   imgc = Image.open(filename).crop((0, 150, 270, 1050))

My suspicion is that I can make that faster using:

GNU Parallel, and/or
pyvips

Here is a pyvips version of your code:

#!/usr/bin/env python3

import sys
import pyvips
import numpy as np

# Process all input files so we only incur Python startup overhead once
for filename in sys.argv[1:]:
   print(f'Processing: {filename}')

   img = pyvips.Image.new_from_file(filename, access='sequential')
   roi = img.crop(0, 150, 270, 900)
   mem_img = roi.write_to_memory()

   # Make a numpy array from that buffer object
   nparr = np.ndarray(buffer=mem_img, dtype=np.uint8,
                   shape=[roi.height, roi.width, roi.bands])

Here are the results:

Sequential original code

./orig.py bc*jpg
224 seconds, i.e. 224 ms per image, same as you

Parallel original code

parallel ./orig.py ::: bc*jpg
55 seconds

Parallel original code but passing as many filenames as possible

parallel -X ./orig.py ::: bc*jpg
42 seconds

Sequential pyvips

./vipsversion bc*
30 seconds, i.e. 7x as fast as PIL which was 224 seconds

Parallel pyvips

parallel ./vipsversion ::: bc*
32 seconds

Parallel pyvips but passing as many filenames as possible

parallel -X ./vipsversion ::: bc*
5.2 seconds, i.e. this is the way to go :-)

Note that you can install GNU Parallel on macOS with homebrew:

brew install parallel

回答2:

You might take a look on PyTurboJPEG which is a Python wrapper of libjpeg-turbo with insanely fast rescaling (1/2, 1/4, 1/8) while decoding large JPEG image, the returning numpy.ndarray is handy for image cropping. Moreover, JPEG image encoding speed is also remarkable.

from turbojpeg import TurboJPEG

# specifying library path explicitly
# jpeg = TurboJPEG(r'D:\turbojpeg.dll')
# jpeg = TurboJPEG('/usr/lib64/libturbojpeg.so')
# jpeg = TurboJPEG('/usr/local/lib/libturbojpeg.dylib')

# using default library installation
jpeg = TurboJPEG()

# direct rescaling 1/2 while decoding input.jpg to BGR array
in_file = open('input.jpg', 'rb')
bgr_array_half = jpeg.decode(in_file.read(), scaling_factor=(1, 2))
in_file.close()

# encoding BGR array to output.jpg with default settings.
out_file = open('output.jpg', 'wb')
out_file.write(jpeg.encode(bgr_array))
out_file.close()

libjpeg-turbo prebuilt binaries for macOS and Linux are also available here.

来源：https://stackoverflow.com/questions/54975193/fast-way-to-import-and-crop-a-jpeg-in-python-lib

标签

python

image-processing

jpeg

Fast way to import and crop a jpeg in python lib

问题