Tesseract OCR Read Horizontally rather than Vertically C#

只谈情不闲聊 提交于 2020-06-13 08:57:44

问题


We have a C# .Net app that is using Tesseract to do Optical Character Recognition (OCR) on .tiff files. Here's an Example:

We are then outputting the data to a text file. However, Tesseract is reading the data in a Vertical fashion. In my example image, it is reading the tiff as two columns of data and the data the data is being outputted from Tesseract like this:

TYPE: DATE: Address: City: State: Owner: Owner Type: Acreage: Mortgage: 12345 2017-04-06 100 Main St. Some City Some State John Doe Primary 10.25 Yes

What we want is Tesseract to read the tiff file horizontally and have the output look like this:

TYPE:12345 DATE:2017-04-06 Address:100 Main St. City:Some City State:Some State Owner:John Doe Owner Type:Primary Acreage:10.25 Mortgage:Yes

We've tried the various Page Sementation options for Tesseract, but they all produce the same result.

Has anyone run into this same issue? Anybody have any ideas?


回答1:


I found a solution. Tesseract has a set of config files. Inside several of these config files is the setting tessedit_pageseg_mode. This setting was set to 1 in all the config files. 1=Automatic page segmentation with OSD. OSD=Orientation and script detection.

Bottom line, these config file settings were overwriting our command line argument. Once I removed the tessedit_pageseg_mode parameter from the config files, our command line argument of

-psm 6 worked and produced the output data in the desired format.

psm=Page Segmentation Mode. 6=Assume a single uniform block of text

-psm 4 also worked

psm=Page Segmentation Mode. 4=Assume a single column of text of variable sizes




回答2:


I know this is an old post but I ran into the same problem today.

setting the segmentation mode with engine.SetVariable("tessedit_pageseg_mode", 6); did not work.

And for some reason I didnt find it in the config files.

Solution:

engine.DefaultPageSegMode = PageSegMode.SingleBlock;


来源:https://stackoverflow.com/questions/43259694/tesseract-ocr-read-horizontally-rather-than-vertically-c-sharp

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!