问题
Is it possible to get the font of the recognized characters with Tesseract-OCR, i.e. are they Arial or Times New Roman, either from the command-line or using the API.
I'm scanning documents that might have different parts with different fonts, and it would be useful to have this information.
回答1:
Tesseract has an API WordFontAttributes
function defined in ResultIterator class that you can use.
回答2:
Based on nguyenq's answer i wrote a simple python script that prints the font name for each detected char. This script uses the python lib tesserocr.
from tesserocr import PyTessBaseAPI, RIL, iterate_level
def get_font(image_path):
with PyTessBaseAPI() as api:
api.SetImageFile(image_path)
api.Recognize()
ri = api.GetIterator()
level = RIL.SYMBOL
for r in iterate_level(ri, level):
symbol = r.GetUTF8Text(level)
word_attributes = r.WordFontAttributes()
if symbol:
print u'symbol {}, font: {}'.format(symbol, word_attributes['font_name'])
get_font('logo.jpg')
来源:https://stackoverflow.com/questions/15679017/get-font-of-recognized-character-with-tesseract-ocr