Python, text detection OCR

落爺英雄遲暮 提交于 2019-12-23 17:01:11

问题


I am trying to extract data from a scanned form. The form has a standard format similar to the one shown in the image below:

I have tried using pytesseract (tesseract OCR) to detect the image's text and it has done a decent job at finding the text and converting the image to text. However it essentially just gives me all the detected text without keeping the format of the data.

I would like to be able to do something like the below:

Find a particular piece of text and then find the associated data below or beside it. Similar to this question using opencv Detect text region in image using Opencv

Is there a way that I can essentially do the following:

  1. Either find all text boxes on the form, perform OCR on each box and see which one is the closest match to the "witnesess:" text, then find the sections immediately below it and perform separate OCR on those.
  2. Or if the form is standard and I know the approximate location of the "witness" text section can I specify its general location in opencv and then just extract the below text and perform OCR on it.

EDIT: I have tried the below code to try to detect specific regions of the text. However it is not specifically identifying the text just all regions.

import cv2

img = cv2.imread('t2.jpg')
mser = cv2.MSER_create()

img = cv2.resize(img, (img.shape[1]*2, img.shape[0]*2))   
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
vis = img.copy()

regions = mser.detectRegions(gray)
hulls = [cv2.convexHull(p.reshape(-1, 1, 2)) for p in regions[0]]
cv2.polylines(vis, hulls, 1, (0,255,0)) 

cv2.imshow('img', vis)

Here is the result:


回答1:


I think you have the answer already in your own post. I did recently something similar and this is how I did it:

//id_image was loaded with cv2.imread
temp_image = id_image[start_y:end_y,start_x:end_x]
img = Image.fromarray(temp_image)
text = pytesseract.image_to_string(img, config="-psm 7")

So basically, if your format is predefined, you just need to know the location of the fields that you want the text of (which you already know), crop it, and then apply the ocr (tesseract) extraction.

In this case you need import pytesseract, PIL, cv2, numpy.



来源:https://stackoverflow.com/questions/45685911/python-text-detection-ocr

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!