Tesseract: Differing output / How do I find out which parameters are being used on a given run?

妖精的绣舞 提交于 2020-03-04 23:04:57

问题


Consider this small png image depicting the word 'Account' in black on a white background:

For this ground-truth image the output differs between the following two Tesseract command-line operations, with (A) better than (B). (B) is required in order for me as the user to be have any hope of controlling Tesseract's some 660 configuration parameters - but at (A)'s extraction performance.

Case A (no config file):

tesseract -v test.png test

tesseract 4.1.0
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 9c : libpng 1.6.37 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.3 : libopenjp2 2.3.1
 Found AVX2
 Found AVX
 Found SSE
Tesseract Open Source OCR Engine v4.1.0 with Leptonica

cat test.txt

Account
^L

Case B (using config file, which is obviously desirable):

tesseract --print-parameters > tess_default.cfg
tesseract -v test.png test test_default.cfg
ccot
^L      Page separator (default is form feed control character)

I believe the output should be the same in both cases, but it is not. Why? Case A is clearly more accurate in its output, but Case B is not.

How does one otherwise discover the current configuration of Tesseract if not using --print-parameters?

Please consider only the Tesseract command line under *nix - no python, Java SDKs etc. on this occasion.

Thanks!

  • Tesseract Version: 4.1.0
  • Commit Number: [executed: brew install tesseract]
  • Platform: macOS High Sierra 10.13.6 / Darwin redacted.office 17.7.0 Darwin Kernel Version 17.7.0: Sun Jun 2 20:31:42 PDT 2019; root:xnu-4570.71.46~1/RELEASE_X86_64 x86_64

来源:https://stackoverflow.com/questions/57794165/tesseract-differing-output-how-do-i-find-out-which-parameters-are-being-used

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!