问题
We have a C# .Net app that is using Tesseract to do Optical Character Recognition (OCR) on .tiff files. Here's an Example:
We are then outputting the data to a text file. However, Tesseract is reading the data in a Vertical fashion. In my example image, it is reading the tiff as two columns of data and the data the data is being outputted from Tesseract like this:
TYPE: DATE: Address: City: State: Owner: Owner Type: Acreage: Mortgage: 12345 2017-04-06 100 Main St. Some City Some State John Doe Primary 10.25 Yes
What we want is Tesseract to read the tiff file horizontally and have the output look like this:
TYPE:12345 DATE:2017-04-06 Address:100 Main St. City:Some City State:Some State Owner:John Doe Owner Type:Primary Acreage:10.25 Mortgage:Yes
We've tried the various Page Sementation options for Tesseract, but they all produce the same result.
Has anyone run into this same issue? Anybody have any ideas?
回答1:
I found a solution. Tesseract has a set of config files. Inside several of these config files is the setting tessedit_pageseg_mode. This setting was set to 1 in all the config files. 1=Automatic page segmentation with OSD.
OSD=Orientation and script detection.
Bottom line, these config file settings were overwriting our command line argument. Once I removed the tessedit_pageseg_mode parameter from the config files, our command line argument of
-psm 6 worked and produced the output data in the desired format.
psm=Page Segmentation Mode. 6=Assume a single uniform block of text
-psm 4 also worked
psm=Page Segmentation Mode. 4=Assume a single column of text of variable sizes
回答2:
I know this is an old post but I ran into the same problem today.
setting the segmentation mode with engine.SetVariable("tessedit_pageseg_mode", 6);
did not work.
And for some reason I didnt find it in the config files.
Solution:
engine.DefaultPageSegMode = PageSegMode.SingleBlock;
来源:https://stackoverflow.com/questions/43259694/tesseract-ocr-read-horizontally-rather-than-vertically-c-sharp