Tesseract OCR won't recognize division symbol “÷”

北战南征 提交于 2021-01-21 07:21:08

问题


I am using Tesseract in iOS 8 for an OCR based app but it incorrectly converts the division "÷" symbol in the image to a plus "+" sign.

For example, this image

Simple arithmetic expression

always converts to the text string "8+4+4". It should be "8+4÷4".

I've tried using different trained data language files "eng+equ", "ita", adding "÷" to the whitelist, setting the ocr_engine variable to cube, converting image to grayscale or black & white, upsizing the image by 2 and 4 times.

Everything I've tried always returns a plus "+" sign instead of a division "÷" symbol.

I tried using only the "equ" trained data file and that DOES return the division symbol correctly - but all other characters are then garbage.

I've been looking into this (Google, Stackoverflow) for several days and cannot figure it out.

How do I get Tesseract to include and recognize the division "÷" symbol?

UPDATE:

The best I have been able to do is to set the AVCaptureSession preset to high

AVCaptureSession *session = [[AVCaptureSession alloc] init];
session.sessionPreset = AVCaptureSessionPresetHigh;

The captured image above dimensions are then 676 × 405 pixels. Using Tesseract OCR UIImage category (image is named 'source') to binarize the image:

// Binarize the source image to improve contrast (using the UIImage category provided by TesseractOCR)
UIImage *blackAndWhiteImage = [source blackAndWhite];
[self.tesseract setImage:blackAndWhiteImage];

This will usually convert the division symbol to the text "-1-", but I've seen "-:-" and other numbers and uppercase characters between the minus signs.

I can check for that in the returned text. But then it is impossible to know whether to treat the returned text "8-1-2" as a true subtraction or 'maybe' division.


回答1:


Train the or engine wit different fonts.

Here is the tool for training the engine. Have a look on this also

Or you can use JTessBoxEditor




回答2:


Make sure your "white list" includes"÷" sign.

In swift, this will do it: tesseract.setVariableValue("0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:;,.!-()#&÷", forKey: "tessedit_char_whitelist")

In objective-C, here is the code:

[tesseract setVariableValue:@"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:;,.!-()#&÷" forKey:@"tessedit_char_whitelist"];

You can customize the character set based on your needs.




回答3:


It seems that symbol was not included in the existing data. You'd need to train for that symbol, and then use the resultant traineddata in combination with existing ones.

You can use a tool, such as jTessBoxEditor, to assist you in the training process.




回答4:


You can also try and capture this ambiguity via the unicharambigs file. Read more https://github.com/tesseract-ocr/tesseract/blob/master/doc/unicharambigs.5.asc.

1       +       1      ÷    0

Tesseract would read it as "optionally (the trailing 0 in the above config) replace the 1 char sequence '+' with the 1 character sequence '÷'".




回答5:


In Swift, changing engineMode works for me

let tesseract = G8Tesseract(language: "eng")!
tesseract.engineMode = .tesseractCubeCombined


来源:https://stackoverflow.com/questions/26956913/tesseract-ocr-wont-recognize-division-symbol-%c3%b7

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!