Batch OCR of 5800+ PDF written in German Fraktur

我的梦境 提交于 2019-12-09 02:19:29

First, I would recommend you install homebrew if you have not already - it is an excellent package manager for the Mac.

Then I would recommend you install the Poppler package to get the pdfimages tool:

brew install poppler

You can then extract images from a PDF like this:

pdfimages SomeFile.pdf root

and you will get files named root-000.ppm and root-001.ppm which will work fine with tesseract. Or you can add -png if you want PNG images. I would avoid JPEG because of lossy compression.

If you can get that working, I would then suggest you install GNU Parallel with:

brew install parallel

and we can work on doing OCR in parallel down the line.


PLEASE TRY THE FOLLOWING ONLY IN A SMALL DIRECTORY WITH 5-6 COPIES OF YOUR ORIGINALS

We can also extract the images in parallel using GNU Parallel like this:

parallel 'mkdir {.} && pdfimages {} {.}/{.}' ::: *pdf

As regards using Fred's textcleaner with GNU Parallel, and wanting to overwrite the JPEGs, I think you will want something like this:

find . -name \*.jpg | parallel textcleaner -g -e stretch -f 25 -o 10 -u -s 1 -T -p 10 {} {}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!