Tesseract OCR force pattern

流过昼夜 提交于 2019-12-05 22:17:20

问题


I want to read a specific character sequence with Tesseract like this post : Tesseract OCR: is it possible to force a specific pattern?

I have tried bazaar matching pattern in Tesseract with the pattern \d\d\d\A\A and ocr still recognize other words which doesn't match.

I have tried to use the "tessedit_char_whitelist" parameter but I can't choose the position of the characters with that.

  • I launch the command : tesseract image.jpg result -l eng bazaar And I have this message :

Please provide at least 4 concrete characters at the beginning of the pattern

Invalid user pattern \A\A\d\d\d

Tesseract Open Source OCR Engine v3.01 with Leptonica

  • image.jpg :

  • The result :

    AB123
    ABC12
    A1234
    12345
    ABCD1
    

So it is wrong, I just wanted to catch the sequence "AB123".

Can somebody tell me why the regular expression in my user-patterns file as no effect ? For the configuration, I have strictly followed the bazaar tutorial.


回答1:


Try using this pattern with quantifiers instead.

[a-zA-Z]{2}\d{3}

This should cover only 2 alphabetical characters and 3 digits.

The reason why you are matching everything before is because \w is alphanumeric.



来源:https://stackoverflow.com/questions/31874393/tesseract-ocr-force-pattern

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!