How to ignore special characters in Tesseract OCR using java

末鹿安然 提交于 2019-12-26 06:33:44

问题


I have extracted text from image through Tesseract OCR using java. But the output is consisting of some special characters because image contains some symbols.

I want to ignore all the special characters and display just text. Is there any way that i can do that?


回答1:


In tesseract you can set TessBaseAPI.VAR_CHAR_WHITELIST and TessBaseAPI.VAR_CHAR_BLACKLIST in order to ignore some special characters.

Following would make tesseract only recognize A-Z and digits

String whiteList = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
tessBaseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST,whiteList);

Next snippet would allow you to recognize everything except for ~ and fl

String blackList = "~fl";
tessBaseApi.setVariable(TessBaseAPI.VAR_CHAR_BLACKLIST,blackList );

Also please note that as mentioned in tesseract github issue you can't black or whitelist characters with tesseract 4.0 Alpha LSTM, instead you should train LSTM with characters you expect on your image.

Of course if you want - you can still use 3.* versions of tesseract, its tessdata is located here



来源:https://stackoverflow.com/questions/48702490/how-to-ignore-special-characters-in-tesseract-ocr-using-java

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!