问题
I have a python app that imports 200k+ images, crops them, and presents the cropped image to pyzbar to interpret a barcode. Cropping helps because there are multiple barcodes on the image and, presumably pyzbar is a little faster when given smaller images.
Currently I am using Pillow to import and crop the image.
On the average importing and cropping an image takes 262 msecs and pyzbar take 8 msecs.
A typical run is about 21 hours.
I wonder if a library other than Pillow might offer substantial improvements in loading/cropping. Ideally the library should be available for MacOS but I could also run the whole thing in a virtual Ubuntu machine.
I am working on a version that can run in parallel processes which will be a big improvement but if I could get 25% or more speed increase from a different library I would also add that.
回答1:
As you didn't provide a sample image, I made a dummy file with dimensions 2544x4200 at 1.1MB in size and it is provided at the end of the answer. I made 1,000 copies of that image and processed all 1,000 images for each benchmark.
As you only gave your code in the comments area, I took it, formatted it and made the best I could of it. I also put it in a loop so it can process many files for just one invocation of the Python interpreter - this becomes important when you have 20,000 files.
That looks like this:
#!/usr/bin/env python3
import sys
from PIL import Image
# Process all input files so we only incur Python startup overhead once
for filename in sys.argv[1:]:
print(f'Processing: {filename}')
imgc = Image.open(filename).crop((0, 150, 270, 1050))
My suspicion is that I can make that faster using:
- GNU Parallel, and/or
- pyvips
Here is a pyvips
version of your code:
#!/usr/bin/env python3
import sys
import pyvips
import numpy as np
# Process all input files so we only incur Python startup overhead once
for filename in sys.argv[1:]:
print(f'Processing: {filename}')
img = pyvips.Image.new_from_file(filename, access='sequential')
roi = img.crop(0, 150, 270, 900)
mem_img = roi.write_to_memory()
# Make a numpy array from that buffer object
nparr = np.ndarray(buffer=mem_img, dtype=np.uint8,
shape=[roi.height, roi.width, roi.bands])
Here are the results:
Sequential original code
./orig.py bc*jpg
224 seconds, i.e. 224 ms per image, same as you
Parallel original code
parallel ./orig.py ::: bc*jpg
55 seconds
Parallel original code but passing as many filenames as possible
parallel -X ./orig.py ::: bc*jpg
42 seconds
Sequential pyvips
./vipsversion bc*
30 seconds, i.e. 7x as fast as PIL which was 224 seconds
Parallel pyvips
parallel ./vipsversion ::: bc*
32 seconds
Parallel pyvips but passing as many filenames as possible
parallel -X ./vipsversion ::: bc*
5.2 seconds, i.e. this is the way to go :-)
Note that you can install GNU Parallel on macOS with homebrew:
brew install parallel
回答2:
You might take a look on PyTurboJPEG which is a Python wrapper of libjpeg-turbo with insanely fast rescaling (1/2, 1/4, 1/8) while decoding large JPEG image, the returning numpy.ndarray is handy for image cropping. Moreover, JPEG image encoding speed is also remarkable.
from turbojpeg import TurboJPEG
# specifying library path explicitly
# jpeg = TurboJPEG(r'D:\turbojpeg.dll')
# jpeg = TurboJPEG('/usr/lib64/libturbojpeg.so')
# jpeg = TurboJPEG('/usr/local/lib/libturbojpeg.dylib')
# using default library installation
jpeg = TurboJPEG()
# direct rescaling 1/2 while decoding input.jpg to BGR array
in_file = open('input.jpg', 'rb')
bgr_array_half = jpeg.decode(in_file.read(), scaling_factor=(1, 2))
in_file.close()
# encoding BGR array to output.jpg with default settings.
out_file = open('output.jpg', 'wb')
out_file.write(jpeg.encode(bgr_array))
out_file.close()
libjpeg-turbo prebuilt binaries for macOS and Linux are also available here.
来源:https://stackoverflow.com/questions/54975193/fast-way-to-import-and-crop-a-jpeg-in-python-lib