tesseract

How to speed up tesseract OCR

╄→尐↘猪︶ㄣ 提交于 2019-12-08 01:42:45
问题 I'm trying to OCR a lot of documents(I mean in 300k + range a day). At the moment i'm using Tesseract wrapper for .NET and it's all good in quality but the speed is not good enough. The times i get for 20 tasks in parallel scanning of a half page from the same pdf in average are 2,546 second per scan. The code im using: using (var engine = new TesseractEngine(Tessdata, "eng", EngineMode.TesseractOnly)) { Page page; page = engine.Process(image, srcRect); var text = page.GetText(); return Task

“ValueError: cannot filter palette images” during Pytesseract Conversion

☆樱花仙子☆ 提交于 2019-12-07 19:31:32
问题 Having trouble with this error code regarding the following code for Pytesseract. (Python 3.6.1, Mac OSX) import pytesseract import requests from PIL import Image from PIL import ImageFilter from io import StringIO, BytesIO def process_image(url): image = _get_image(url) image.filter(ImageFilter.SHARPEN) return pytesseract.image_to_string(image) def _get_image(url): r = requests.get(url) s = BytesIO(r.content) img = Image.open(s) return img process_image("https://www.prepressure.com/images

Add any traineddata file in tesseract and use in IOS

寵の児 提交于 2019-12-07 17:50:45
问题 I am able to compile the ENGLISH version which is already in sample for tesseract but not able to add other language like ara.traineddata. I am doing like Tesseract* tesseract = [[Tesseract alloc] initWithDataPath:@"tessdata" language:@"ara+eng"]; And it is recogninzing ENGLISH but for ara it is giving error Error opening data file /Users/harshthakur/Library/Application Support/iPhone Simulator/7.0/Applications/3B0A1909-E1BA-45E9-99A0-FDEAB2CFF4E0/Documents/tessdata/ara.traineddata Please

tess-two can't find libpng.so

怎甘沉沦 提交于 2019-12-07 16:44:42
问题 I have followed building instructions for tess-two on Github I build tess-two using NDK successfully and imported the library I am trying to run the test application provided on the same repository but whenever the application starts it gives the following exception: That error is caused once new TessBaseAPI(); is called. dlopen("/data/app-lib/com.datumdroid.android.ocr.simple-2/liblept.so") failed: Cannot load library: soinfo_link_image(linker.cpp:1635): could not load library "libpng.so"

Invoking via command line versus JNI

只谈情不闲聊 提交于 2019-12-07 14:47:56
问题 I need to invoke tesseract OCR (its an open source library in C++ that does Optical Character Recognition) from a Java Application Server. Right now its easy enough to run the executable using Runtime.exec(). The basic logic would be Save image that is currently held in memory to file (a .tif) pass in the image file name to the tesseract command line program. read in the output text file from Java using FileReader. How much improvement in terms of performance am I likely to get by writing a

Improve horizontal line detection in .pdf image with OpenCV

被刻印的时光 ゝ 提交于 2019-12-07 11:17:15
问题 I have .pdf files that have been converted to .jpg images for this project. My goal is to identify the blanks (e.g ____________) that you would generally find in a .pdf form that indicate a space for the user to sign of fill out some kind of information. I have been using edge detection with the cv2.Canny() and cv2.HoughlinesP() functions. This works fairly well, but there are quite a few false positives that come about from seemingly nowhere. When I look at the 'edges' file it shows a bunch

Free hand character recognition in android

核能气质少年 提交于 2019-12-07 10:57:10
问题 We are working on an android application that involves free hand character recognition. The application requires to student to draw the free hand image of an alphabet on the android screen,and the application process the image drawn and returns the accuracy of the alphabet written. We are considering two options a. Using tesseract. b. Using our own algorithm on which we are still working Problems a. Tesseract is not at all helping in recognizing free hand characters.Any pointers on how to

How to integrate Tesseract OCR Library to a C++ program

和自甴很熟 提交于 2019-12-07 10:29:14
问题 I am trying to use Tesseract OCR Library in order to create a program to read pictures of elevator floor numbers. I haven't found any example on how to include the Tesseract Library into a C++ file. Something like: #include "tesseract.h" I am using Tesseract v 3.00 on Ubuntu 10.10. 回答1: The PlatformStatus Page has some comments on how to install it. It has dependencies (leptonica) which also need to be installed. Another solution also linked from the above discussion has similar details for

JAVA Tess4j doOCR() not working, Exception “Invalid memory access”

可紊 提交于 2019-12-07 09:30:00
问题 I'm working in dynamic web project in eclipse, I made a TesseractOCR class that contain: public class TesseractOCR { public TesseractOCR() { } public String doOCR(String file) { System.setProperty("jna.library.path", "32".equals(System.getProperty("sun.arch.data.model")) ? "lib/win32-x86" : "lib/win32-x86-64"); File imageFile = new File("C:\\Users\\Sherein Dabbah\\Downloads\\ca096-d7a6d799d7a1d798d799d7a72.jpg"); Tesseract instance = Tesseract.getInstance(); // JNA Interface Mapping

how to detect orientation of a scanned document?

核能气质少年 提交于 2019-12-07 08:45:44
问题 I'd to detect and, if necessary, correct the orientation of a scanned document image. I am already able to deskew documents, however it still might occur, that a document is upside down and it needs to be rotated by 180°. Using tesseract's layout analysis feature it should be possible to determine a document's orientation using this code: tesseract::TessBaseAPI api; api.Init(argv[0], "eng"); api.SetImage(img); api.SetPageSegMode(tesseract::PSM_AUTO_OSD); tesseract::PageIterator* it = api