Extract images from PDF, how to handle JBIG2 encoded

问题

I have a bunch of PDF files, some of them are pure text but some are fully or partially saved as "One image per page" because they are generated from a scanner.

I need to extract all images contained in the PDF and then examine each image separately.

I was able to extract most of the images with a python script found here in SO see question:

Extract images from PDF without resampling, in python?

Some of the included images were encoded using JBIG2 and I could not find any python or other tool to convert jbig2 into something that could be easily opened with generic graphic tool.

回答1:

Well I have been struggling with this for many weeks, many answers from SO helped me through, but there was always something missing, apparently no one here has ever had problems with jbig2 encoded images.

In the bunch of PDF that I am to scan, images encoded in jbig2 are very popular.

As far as I understand there are many copy/scan machines that scan papers and transform them into PDF files full of jbig2 encoded images.

So after many days of tests decided to go for the answer proposed here by dkagedal long time ago.

Here is my step by step on linux: (if you have another OS I suggest to use a linux docker it's going to be much easier.)

First step:

apt-get install poppler-utils Then I was able to run command line tool called pdfimages like this:

pdfimages -all myfile.pdf ./images_found/

With the above command you will be able to extract all the images contained in myfile.pdf and you will have them saved inside images_found (you have to create images_found before)

In the list you could find several types of images (depends on you pdf) like: png, jpg, tiff; all these are easily readable with any graphic tool.

Then you will have some files named like: -145.jb2e and -145.jb2g.

These 2 files contain ONE IMAGE encoded in jbig2 which is saved in 2 different files one for the header and one for the data

Again I have lost many days trying to find out how to convert those files into something readable and finally I came across this tool called jbig2dec

So first you need to install this magic tool:

apt-get install jbig2dec

then you can run:

jbig2dec -t png -145.jb2g -145.jb2e

You are going to finally be able to get all extracted images converted into something useful.

good luck!

来源：https://stackoverflow.com/questions/60851124/extract-images-from-pdf-how-to-handle-jbig2-encoded

标签

python

pdf

jbig2