How do I segment a document using Tesseract then output the resulting bounding boxes and labels

后端 未结 6 2050
忘了有多久
忘了有多久 2020-12-07 10:25

I\'m trying to get Tesseract to output a file with labelled bounding boxes that result from page segmentation (pre OCR). I know it must be capable of doing this \'out of the

6条回答
  •  轮回少年
    2020-12-07 11:00

    With Tesseract 4.0.0, a command like tesseract source/dir/myimage.tiff target/directory/basefilename hocr will create a basefilename.hocr file with block-, paragraph-, line-, and word-level bounding boxes for the OCR'ed text. Even the command without the hocr config creates a text file with newlines between block-level text, but the hocr format is more explicit.

    More config options here: https://github.com/tesseract-ocr/tesseract/tree/master/tessdata/configs

提交回复
热议问题