ocr

emgucv: pan card improper skew detection in C#

夙愿已清 提交于 2020-06-29 04:00:03
问题 I am having three image of pan card for testing skew of image using emgucv and c#. 1st image which is on top Detected 180 degree working properly. 2nd image which is in middle Detected 90 dgree should detected as 180 degree. 3rd image Detected 180 degree should detected as 90 degree. One observation I am having that i wanted to share here is when i crop unwanted part of image from up and down side of pan card using paint brush, it gives me expected result using below mention code. Now i

Tesseract ocr act weird while scalling up image size. How to know which scale factor is best for some particular types of image?

混江龙づ霸主 提交于 2020-06-27 18:35:11
问题 I have this 006.jpg image and i tried following python code I downloaded "eng" from tessdata_best and renamed it to "eng_best" img = cv2.imread(file_path) lang = "eng_best" for img_scale_factor in range (1,8): print(file_path, img_scale_factor) img = cv2.resize(img,None,fx=img_scale_factor,fy=img_scale_factor) hocr_data = pytesseract.image_to_pdf_or_hocr(img, extension="hocr", lang=lang, config="--dpi 1") file_name = '{0:03d}_jpg_{1}_x{3}.{2}'.format(6, lang, "hocr", img_scale_factor) with

I want to sort the words extracted from image in order of their occurence using contours detection

五迷三道 提交于 2020-06-27 06:21:15
问题 I am making an OCR, I am using contours detection, I have extracted words and drawn bounding boxes but the problem is that when I crop the individual word, they are not in sorted order. I have tried sorting methods mentioned in this link to sort the contours but they work best on objects but in my case i want to make the order exact. sometimes the sorting is not the best solution it changes pattern of words as different words have different size of bounding boxes in same line and values of 'x

I want to sort the words extracted from image in order of their occurence using contours detection

早过忘川 提交于 2020-06-27 06:20:36
问题 I am making an OCR, I am using contours detection, I have extracted words and drawn bounding boxes but the problem is that when I crop the individual word, they are not in sorted order. I have tried sorting methods mentioned in this link to sort the contours but they work best on objects but in my case i want to make the order exact. sometimes the sorting is not the best solution it changes pattern of words as different words have different size of bounding boxes in same line and values of 'x

AWS-Textract-Key-Value-Pair Java - thread “main” java.lang.NullPointerException

天大地大妈咪最大 提交于 2020-06-22 04:19:26
问题 I am using AWS Textract in a Java Spring boot project. I have set up AWS CLI and have the SDK as a maven dependency. I have written Java code, converted from C# in order to extract the Key and Value pairs and I am receiving the following error after successfully extracting some words " AGENCYCUSTOMERID:FEIN(ifapplicable)MARITALSTATUS/CIVILUNION(ifapplicable)INSUREDLOCATIONCODEBUSPRIMARYE-MAILADDRESS:FEIN(ifapplicable)LINEOFBUSINESSCELLMARITALSTATUScivilUNION(ifapplicable)CELLCELLHOME ":

Tesseract OCR Read Horizontally rather than Vertically C#

只谈情不闲聊 提交于 2020-06-13 08:57:44
问题 We have a C# .Net app that is using Tesseract to do Optical Character Recognition (OCR) on .tiff files. Here's an Example: We are then outputting the data to a text file. However, Tesseract is reading the data in a Vertical fashion. In my example image, it is reading the tiff as two columns of data and the data the data is being outputted from Tesseract like this: TYPE: DATE: Address: City: State: Owner: Owner Type: Acreage: Mortgage: 12345 2017-04-06 100 Main St. Some City Some State John

Tesseract OCR Read Horizontally rather than Vertically C#

青春壹個敷衍的年華 提交于 2020-06-13 08:56:01
问题 We have a C# .Net app that is using Tesseract to do Optical Character Recognition (OCR) on .tiff files. Here's an Example: We are then outputting the data to a text file. However, Tesseract is reading the data in a Vertical fashion. In my example image, it is reading the tiff as two columns of data and the data the data is being outputted from Tesseract like this: TYPE: DATE: Address: City: State: Owner: Owner Type: Acreage: Mortgage: 12345 2017-04-06 100 Main St. Some City Some State John

How to fetch info in structure formate with tesseract ocr in Python?

孤街浪徒 提交于 2020-06-13 07:00:05
问题 I am using Ubuntu. Here is my Image that i get from internet. My concern is to get data as it is formated in the Image and dump it into the Text file (position has to be maintained (95-97% accuracy)) I am working with tesseract-ocr almost same question is here my code-: import cv2 import pytesseract from pytesseract import Output import numpy as np img = cv2.imread("/demo.jpg") d1 = pytesseract.image_to_data(img) print(d1) It gives me completely a wrong output from what I am expecting In

How to extract data from image that contains tabular data?

可紊 提交于 2020-06-11 05:22:32
问题 I am using pytesseract, pillow,cv2 to OCR an image and get the text present in the image. Since my input is a scanned PDF document, I first converted it into an image (JPEG) format and then tried extracting the text. I am only half way there. The input is a table and the titles are not being displayed, since the titles have a black background. I also tried getstructuringelement but unable to figure out a way. Here is what I have done until now- import cv2 import os import numpy as np import

How to extract data from image that contains tabular data?

≯℡__Kan透↙ 提交于 2020-06-11 05:22:13
问题 I am using pytesseract, pillow,cv2 to OCR an image and get the text present in the image. Since my input is a scanned PDF document, I first converted it into an image (JPEG) format and then tried extracting the text. I am only half way there. The input is a table and the titles are not being displayed, since the titles have a black background. I also tried getstructuringelement but unable to figure out a way. Here is what I have done until now- import cv2 import os import numpy as np import