Tesseract - ambiguity in space and tab

匿名 (未验证) 提交于 2019-12-03 02:38:01

问题:

I had a tiff file, which contain some text separated by tabs (4 spaces). But when I extract text out of this tiff image file, i always get a single space between two columns. A sample example:

TIFF IMAGE: col-a    col-b    col-c  desired output: col-a    col-b    col-c  but I am getting the following: col-a col-b col-c 

I tried this with multiple images of same format, but the result is always the same. How do I fix this issue ? Can I train tesseract to understand this?

回答1:

Tesseract compresses consecutive spaces into one. You would need to modify baseapi.cpp to preserve the spaces. The code change can be found in the following posts:

https://groups.google.com/forum/#!searchin/tesseract-ocr/spaces/tesseract-ocr/lGBQiryHcrY/wy5a-L9O3i4J

https://groups.google.com/forum/#!searchin/tesseract-ocr/spaces/tesseract-ocr/9nzPrBZ3118/b3W5GtsFPo0J



回答2:

After a very long research I found the solution. Here are the steps to follow

  1. Upgrade your tesseract to 3.04

  2. Create config.txt (Create a file in the directory where you input the image file)

  3. In config file define "preserve_interword_spaces"

  4. After the work preserve_interword_spaces give either 0 or 1. Ex:

preserve_interword_spaces 0

or

preserve_interword_spaces 1

  1. Test & Cheers!!!


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!