What are best parameters to run ImageMagick to convert low quality pdf to images (for OCR)

别等时光非礼了梦想. 提交于 2019-12-04 03:18:01

You can learn about the detailed settings ImageMagick's "delegates" (external programs IM uses, such as Ghostscript) by typing

convert -list delegate

(On my system that's a list of 32 different commands.) Now to see which commands are used to convert to PNG, use this:

convert -list delegate | findstr /i png

Ok, this was for Windows. You didn't say which OS you use. [*] If you are on Linux, try this:

convert -list delegate | grep -i png

You'll discover that IM does produce PNG only from PS or EPS input. So how does IM get (E)PS from your PDF? Easy:

convert -list delegate | findstr /i PDF
convert -list delegate | grep -i PDF

Ah! It uses Ghostscript to make a PDF => PS conversion, then uses Ghostscript again to make a PS => PNG conversion. Works, but isn't the most efficient way if you know that Ghostscript can do PDF => PNG in one go. And faster. And in much better quality.

About IM's handling of PDF conversion to images via the Ghostscript delegate you should know two things first and foremost:

  1. By default, if you don't give an extra parameter, Ghostscript will output images with a 72dpi resolution. That's why Karl's answer suggested to add -density 600 which tells Ghostscript to use a 600 dpi resolution for its image output.
  2. The detour of IM to call Ghostscript twice to convert first PDF => PS and then PS => PNG is a real blunder. Because you never win and harldy keep quality in the first step, but very often loose some. Reasons:
    • PDF can handle transparencies, which PostScript can not.
    • PDF can embed TrueType fonts, which Ghostscript can not. etc.pp. Conversion in the direction PS => PDF is not that critical....)

That's why I'd suggest you convert your PDFs in one go to PNG (or JPEG) using Ghostscript directly. And use the most recent version 8.71 (soon to be released: 9.01) of Ghostscript! Here are example commands:

gswin32c.exe ^
  -sDEVICE=pngalpha ^
  -o output/page_%03d.png ^
  -r600 ^
  d:/path/to/your/input.pdf

(This is the commandline for Windows. On Linux, use gs instead of gswin32c.exe, and \ instead of ^.) This command expects to find an output subdirectory where it will store a separate file for each PDF page. To produce JPEGs of good quality, try

gs \
  -sDEVICE=jpeg \
  -o output/page_%03d.jpeg \
  -r600 \
  -dJPEGQ=95 \
  /path/to/your/input.pdf

(Linux command version). This direct conversion avoids the intermediate PostScript format, which may have lost your TrueType font and transparency object's information that were in the original PDF file.


[*] D'oh! I missed to see your "linux" tag at first...

-density 600 or so should give you what you need.

At least two other tools you may want to consider:

  • pdfimages, which comes with the package poppler-utils, makes it easy to extract the images from a PDF without degrading them.
  • pdfsandwich, which can give you an OCR'd file by simply running pdfsandwich inputfile.pdf. You may need to tweak the options to get a decent result. See the official page for more info.
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!