Get font of recognized character with Tesseract-OCR

问题

Is it possible to get the font of the recognized characters with Tesseract-OCR, i.e. are they Arial or Times New Roman, either from the command-line or using the API.

I'm scanning documents that might have different parts with different fonts, and it would be useful to have this information.

回答1:

Tesseract has an API WordFontAttributes function defined in ResultIterator class that you can use.

回答2:

Based on nguyenq's answer i wrote a simple python script that prints the font name for each detected char. This script uses the python lib tesserocr.

from tesserocr import PyTessBaseAPI, RIL, iterate_level

def get_font(image_path):
    with PyTessBaseAPI() as api:
        api.SetImageFile(image_path)
        api.Recognize()
        ri = api.GetIterator()
        level = RIL.SYMBOL

        for r in iterate_level(ri, level):
            symbol = r.GetUTF8Text(level)
            word_attributes = r.WordFontAttributes()

            if symbol:
                 print u'symbol {}, font: {}'.format(symbol, word_attributes['font_name'])

  get_font('logo.jpg')

来源：https://stackoverflow.com/questions/15679017/get-font-of-recognized-character-with-tesseract-ocr

标签

tesseract

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!