ocr

Remove Background Color or Texture Before OCR Processing

徘徊边缘 提交于 2019-12-18 18:21:34
问题 When a typical mobile phone user takes picture for a card-size object, some background texture is usually included in the image -- please refer to the attached samples. In certain cases, that background could pollute OCR's accuracy. I am wondering that whether there are solutions or not to remove the background (am positive that there are), or detect the background regions so one can just crop them off before OCR. In case of the attached images, wood tables and counter-top presenting are the

How do I improve the accuracy of the OCR text from Tesseract?

守給你的承諾、 提交于 2019-12-18 17:54:19
问题 I created a basic app for recognizing text using the Tesseract API from Google and integrated it with my camera app. It works fine but the only problem is the accuracy, as sometimes the text is recognized as a random set of characters and I guess the accuracy is about 50 percent. Further, when it tries to scan more than four words in an image, the app crashes. String ocrText = baseApi.getUTF8Text(); baseApi.end(); where baseApi is the object of the Tesseract API class. Do I need to use a

Including Tess4J to a Java project as library in Eclipse

余生颓废 提交于 2019-12-18 15:49:46
问题 I have an so far empty and clean Eclipse Java project. What do I have to do to use Tess4J as library for my web service that I want to develop? Is it even possible to use it as library for an Android project? (would be shortcutting a lot) There is an issue regarding .tif with android that I came across. Tess4J is a wrapper for native code, because tesseract-ocr is written in C/C++. That I've got so far. But how to include this wrapper into my project? I've googled a lot until I have decided

Image processing / super light OCR

淺唱寂寞╮ 提交于 2019-12-18 13:47:25
问题 I have 55 000 image files (in both JPG and TIFF format) which are pictures from a book. The structure of each page is this: some text --- (horizontal line) --- a number some text --- (horizontal line) --- another number some text There can be from zero to 4 horizontal lines on any given page. I need to find what the number is, just below the horizontal line. BUT, numbers strictly follow each other, starting at one on page one, so in order to find the number, I don't need to read it: I could

Getting UnsatisfiedLinkError: no jnilept in java.library.path when I create TessBaseAPI

折月煮酒 提交于 2019-12-18 13:17:59
问题 I am new to java cpp and tesseract-ocr. I am stuck with one issue from couple of hours. I am getting UnsatisfiedLinkError: no jnilept in java.library.path when I create TessBaseAPI. Below is the piece of my code. public static void tesseractForPdf(String filePath) throws Exception { BytePointer outText; TessBaseAPI api = new TessBaseAPI();//getting the UnsatisfiedLinkError exception here. // Initialize tesseract-ocr with English, without specifying tessdata path if (api.Init(".", "ENG") != 0)

Image processing for OCR with leptonica (inverse color text)

旧街凉风 提交于 2019-12-18 10:45:19
问题 I am trying to process the following image with leptonica to extract text with tesseract. Original Image: Tesseract on the original image yields this: i s l D2J1FiiE-l191x1iitmwii9 uhiaiislz-2 Q ~37 Bottom linez With a little time! you can learn social media technology using free online resources- And if you donity youlll be at a significant disadvantage to other HOn-pFOiiTS- Not great, especially the top background. So using leptionica I use a background removal algorithm (blur, difference,

Image processing and extraction of characters

。_饼干妹妹 提交于 2019-12-18 10:17:00
问题 I'm trying to figure out what technologies I would need to process images for characters. Specifically, in this example, I need to extract the hashtag that is circled. You can see it here: Any implementations would be of great assistance. 回答1: It is possible to solve this problem with OpenCV + Tesseract though I think there might be easier ways. OpenCV is an open source library used to build computer vision applications and Tesseract is an open source OCR engine. Before we start, let me

How to make tesseract to recognize only numbers, when they are mixed with letters?

吃可爱长大的小学妹 提交于 2019-12-18 10:08:05
问题 I want to use tesseract to recognize only numbers. The problem is that I have mixture of numbers & letters and when I use SetVariable("tessedit_char_whitelist", "0123456789") for every symbol tesseract returns wrong digit. Can I set a threshold value so that tesseract omits the symbols with low resemblance? NOTE: I set tesseract to recognize only digits so there is no confusion between O and 0. 回答1: Recognizing only numbers is actually answered on the tesseract FAQ page. See that page for

Does Tesseract's hOCR output really contain bounding boxes and confidence levels for each character?

China☆狼群 提交于 2019-12-18 04:12:58
问题 In the Tesseract FAQ they say you can: How can I get the coordinates and confidence of each character ? There are two options. If you would rather not get into programming, you can use Tesseract's hocr output format (read the Tesseract manual page for details). But when I created a sample hOCR output (it's an .html file), the bounding boxes and confidence levels were only available at the word level . Am I missing something here? I've added the sample input/output as illustration (the input

Batch OCR Program for PDFs [closed]

喜欢而已 提交于 2019-12-17 17:34:36
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . This has been asked before, but I don't really know if the answers help me. Here is my problem: I got a bunch of (10,000 or so) pdf files. Some were text files that were saved using adobe's print feature (so their text is perfect and I don't want to risk screwing them up). And some were scanned images (so they