问题
I have an application where technical datasheets are OCR'd using the tesseract API. I initialize it like this:
tesseract::TessBaseAPI tess;
tess.Init(NULL, "eng", tesseract::OEM_TESSERACT_ONLY);
However, even after using custom whitelists like this
tess.SetVariable("tessedit_char_blacklist", "");
tess.SetVariable("tessedit_char_whitelist", myWhitelist);
some datasheet entries are recognized wrongly, for example PA3
is recognized as FAB
.
How can I disable the dictionary-assisted OCR, i.e. . In order to not affect other tools I don't want to modify global config files if possible.
Note: This is not a duplicate of this previous question because said question explicitly asks for the command-line tool while I explicitly ask for the tesseract API.
回答1:
You can do it in following way
tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
if (api->Init(NULL, "eng"))
{
fprintf(stderr, "Could not initialize tesseract.\n");
exit(1);
}
if(!api->SetVariable("tessedit_enable_doc_dict", "0"))
{
cout << "Unable to enable dictionary" << endl;
}
Simply pass "tessedit_enable_doc_dict"
as a parameter to SetVariable
function and it's corresponding boolean value.
I found it in tesseractclass.h
https://tesseract-ocr.github.io/a00736_source.html header file(line 839) and i guess best way to find correct parameters is by looking at the values defined at it(header file corresponding to your version. mine is 3.04).
I tried few i found on internet before but didn't work. This was the working configuration to me.
回答2:
You can simply set the penalties to zero:
tess.SetVariable("segment_penalty_garbage", "0");
tess.SetVariable("segment_penalty_dict_nonword", "0");
tess.SetVariable("segment_penalty_dict_frequent_word", "0");
tess.SetVariable("segment_penalty_dict_case_ok", "0");
tess.SetVariable("segment_penalty_dict_case_bad", "0");
While the dictionary still stays active, this approach basically tells the algorithm that a dictionary-hit (also includes bad punctuation etc) is no better than a non-dictionary hit.
See the dict.cpp source code for reference.
回答3:
You can turn of dictionaries only during initialization of API. See tesseract-ocr API example in C++ of changing init parameters for tesseract 3.02.
来源:https://stackoverflow.com/questions/33005215/disable-dictionary-assisted-ocr-in-tesseract-c-api