Batch OCRing PDFs that haven't already been OCR'd

后端 未结 4 985
滥情空心
滥情空心 2021-01-14 16:04

If I have 10,000 PDFs, some of which have been OCRed, some of which have 1 page that has been OCRed but the rest of the pages have not, how can I go through all the PDFs and

4条回答
  •  死守一世寂寞
    2021-01-14 16:38

    Unburying this thread.

    You can know which PDF files have already been OCRed by testing them with pdffonts. If there are embedded fonts, it's very probable that the PDF is already OCRed.

    As for the batch processing, I wrote a little script that can batch OCR to pdf/word/excel/csv output format.

    You may find it at https://github.com/deajan/pmOCR pmOCR (poor man's OCR is a wrapper for Abbyy OCR CLI for linux or Tesseract 3 open source solution).

提交回复
热议问题