tesseract

Forcing Tesseract to match pattern (four digits in a row)

≡放荡痞女 提交于 2019-12-06 04:38:37
I'm trying to get Tesseract (using the Tess4J wrapper) to match only a specific pattern. The pattern is four digits in a row, which I think would be \d\d\d\d. Here is a VERY small subset of the image I'm feeding tesseract (the floorplans are restricted, so I'm cautious to post much more of it): http://mike724.com/view/a06771 I'm using the following java code: File imageFile = new File("/<redacted>/file.pdf"); Tesseract instance = Tesseract.getInstance(); instance.setTessVariable("load_system_dawg", "F"); instance.setTessVariable("load_freq_dawg", "F"); instance.setTessVariable("user_words

Symbol lookup error while using Tesseract

十年热恋 提交于 2019-12-06 03:40:58
I've been using Tesseract 4, for a project for more than two months now. (This means that it's running on input images for more than two months.) The problem that I'm shown is: multiprocess.pool.RemoteTraceback: """ Traceback (most recent call last): File "/home/cse/.local/lib/python3.5/site-packages/multiprocess/pool.py", line 119, in worker result = (True, func(*args, **kwds)) File "/home/cse/.local/lib/python3.5/site-packages/multiprocess/pool.py", line 44, in mapstar return list(map(*args)) File "/home/cse/.local/lib/python3.5/site-packages/pathos/helpers/mp_helper.py", line 15, in <lambda

“language_model_penalty_non_dict_word” has no effect in tesseract 3.01

流过昼夜 提交于 2019-12-06 03:38:44
问题 I'm setting language_model_penalty_non_dict_word through a config file for Tesseract 3.01, but its value doesn't have any effect. I've tried with multiple images, and multiple values for it, but the output for each image is always the same. Another user has noticed the same in a comment in another question. Edit: After looking inside the source, the variable language_model_penalty_non_dict_word is used only inside the function float LanguageModel::ComputeAdjustedPathCost . However, this

Android OCR detecting digits only using popular tessercat fork tess-two

£可爱£侵袭症+ 提交于 2019-12-06 03:15:52
问题 I'm using the popular OCR tessercat fork for android tess-two https://github.com/rmtheis/tess-two. I integrated all the staff and it works etc... But I need to detect only digits, my code for now is: TessBaseAPI baseApi = new TessBaseAPI(); baseApi.init(pathToLngFile, langName); baseApi.setImage(bitmap); String recognizedText = baseApi.getUTF8Text(); baseApi.end(); doSomething(recognizedText); From here https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_recognize_only_digits? I'm using

Invoking via command line versus JNI

断了今生、忘了曾经 提交于 2019-12-06 01:13:31
I need to invoke tesseract OCR (its an open source library in C++ that does Optical Character Recognition) from a Java Application Server. Right now its easy enough to run the executable using Runtime.exec(). The basic logic would be Save image that is currently held in memory to file (a .tif) pass in the image file name to the tesseract command line program. read in the output text file from Java using FileReader. How much improvement in terms of performance am I likely to get by writing a JNI wrapper for Tesseract? Unfortunately there is not an open source JNI wrapper that works in Linux. I

How to install leptonica+tesseract on Windows without Visual Studio to use in Anaconda?

自闭症网瘾萝莉.ら 提交于 2019-12-05 22:33:08
I wanted to perform text recognition from images and I want to use Python. I installed Anaconda. Now I want to install Tesseract but I also need to install Leptonica. I did not find any clear instruction how to do it in windows. For Leptonica I do not want to install Visual Studio. So could anybody provide clear instructions how to install leptonica and tesseract on Windows without Visual Studio to use in anaconda ? Thanks. Here is simple set of steps to have tesseract 3.05 dev version as of 04/22/2016 working both on windows 7 and windows 8 machines: 1- install tesseract from its executable

Tesseract OCR force pattern

流过昼夜 提交于 2019-12-05 22:17:20
问题 I want to read a specific character sequence with Tesseract like this post : Tesseract OCR: is it possible to force a specific pattern? I have tried bazaar matching pattern in Tesseract with the pattern \d\d\d\A\A and ocr still recognize other words which doesn't match. I have tried to use the "tessedit_char_whitelist" parameter but I can't choose the position of the characters with that. I launch the command : tesseract image.jpg result -l eng bazaar And I have this message : Please provide

Building Tesseract with Android NDK

白昼怎懂夜的黑 提交于 2019-12-05 21:28:32
I'm following this tutorial to compile this fork of Tesseract (an optical character recognition package) for Android. I'm at the step where I use Cygwin to build the NDK for the Tesseract Android project. I'm getting the following error when invoking ndk-build (from the tess-two directory): c:/android-ndk-r8b-windows/android-ndk-r8b/toolchains/arm-linux-androideabi-4.6/prebuilt/windows/bin/../lib/gcc/arm-linux-androideabi/4.6.x-google/../../../../arm-linux-androideabi/bin/ld.exe: cannot find ./obj/local/armeabi-v7a/libgnustl_static.a: Permission denied What could be causing this error? On a

CMake for Tesseract and OpenCV

那年仲夏 提交于 2019-12-05 21:07:00
I am new to Linux programming, I am trying create an OCR application on Ubuntu 12.10 using Tesseract and OpenCV . So far I have setup tesseract and OpenCV on linux also I have followed this tutorial , in this tutorial I found it very easy that we create one file CMakeList.txt and link OpenCV in it. Now I am trying to compile tesseract-ocr library with this code . As I know I did not make a link between tesseract-ocr and my code and thats why I am having errors. All I want and searching for is if I can link Tesseract and OpenCV using CMake in one file, if it is possible. A tutorial would be

Best method to train Tesseract 3.02

扶醉桌前 提交于 2019-12-05 20:59:10
i'm wondering what is the best method to train Tesseract (kind of text/TIFF and so on) for a particular kind of documents, with these particularities: the structure and main text of the documents is always the same the only things that change are 5 alphanumeric codes (THIS ARE THE REAL IMPORTANT THING TO DETECT!) Some of thes codes are bold At the moment I used standard trained datas, I detect the entire text and I extrapolate the codes with some regular expressions. It's okay, but I've got errors sometimes, for example: 0 / O L / I / 1 Please someone knowns some "tricks" to improve precision?