PDF text extraction from given coordinates

前端 未结 3 1213
情话喂你
情话喂你 2020-11-27 10:43

I would like to extract text from a portion (using coordinates) of PDF using Ghostscript.

Can anyone help me out?

3条回答
  •  执笔经年
    2020-11-27 11:20

    I'm not sure GhostScript can accept coordinates, but you can convert the PDF to a image and send it to an OCR engine either as a subimage cropped from the given coordinates or as the whole image along with the coordinates. Some OCR API accepts a rectangle parameter to narrow the region for OCR.

    Look at VietOCR for a working example, which uses Tesseract as its OCR engine and GhostScript as PDF-to-image converter.

提交回复
热议问题