Disable dictionary-assisted OCR in tesseract C++ API

给你一囗甜甜゛ 提交于 2021-01-27 15:59:22

问题


I have an application where technical datasheets are OCR'd using the tesseract API. I initialize it like this:

tesseract::TessBaseAPI tess;
tess.Init(NULL, "eng", tesseract::OEM_TESSERACT_ONLY);

However, even after using custom whitelists like this

tess.SetVariable("tessedit_char_blacklist", "");
tess.SetVariable("tessedit_char_whitelist", myWhitelist);

some datasheet entries are recognized wrongly, for example PA3 is recognized as FAB.

How can I disable the dictionary-assisted OCR, i.e. . In order to not affect other tools I don't want to modify global config files if possible.

Note: This is not a duplicate of this previous question because said question explicitly asks for the command-line tool while I explicitly ask for the tesseract API.


回答1:


You can do it in following way

tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
if (api->Init(NULL, "eng"))
{
    fprintf(stderr, "Could not initialize tesseract.\n");
    exit(1);
}

if(!api->SetVariable("tessedit_enable_doc_dict", "0"))
{
    cout << "Unable to enable dictionary" << endl;
}

Simply pass "tessedit_enable_doc_dict" as a parameter to SetVariable function and it's corresponding boolean value.

I found it in tesseractclass.h https://tesseract-ocr.github.io/a00736_source.html header file(line 839) and i guess best way to find correct parameters is by looking at the values defined at it(header file corresponding to your version. mine is 3.04). I tried few i found on internet before but didn't work. This was the working configuration to me.




回答2:


You can simply set the penalties to zero:

tess.SetVariable("segment_penalty_garbage", "0");
tess.SetVariable("segment_penalty_dict_nonword", "0");
tess.SetVariable("segment_penalty_dict_frequent_word", "0");
tess.SetVariable("segment_penalty_dict_case_ok", "0");
tess.SetVariable("segment_penalty_dict_case_bad", "0");

While the dictionary still stays active, this approach basically tells the algorithm that a dictionary-hit (also includes bad punctuation etc) is no better than a non-dictionary hit.

See the dict.cpp source code for reference.




回答3:


You can turn of dictionaries only during initialization of API. See tesseract-ocr API example in C++ of changing init parameters for tesseract 3.02.



来源:https://stackoverflow.com/questions/33005215/disable-dictionary-assisted-ocr-in-tesseract-c-api

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!