How to preserve document structure in tesseract

后端未结

关注

 5  1495

南方客 2020-12-11 00:40

I am using tesseract ocr to extract text from an image. Preserving the structure of the document is very important to me. Currently tesseract does not preserve the structure

5条回答

轮回少年 (楼主)

2020-12-11 01:27

Newer versions of tesseract (3.04) have an option called preserve_interword_spaces which should do what you want.

Note that the number of spaces tesseract detects between words may not always be the same between similar lines. So words that are left-aligned with a run of spaces preceding them (as in your example) may not be output this way -- the preserve_interword_spaces option does not attempt to do anything fancy, it merely preserves the spaces extraction found. By default tesseract collapses runs of spaces into one.

Details on this option are here.

0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...