Select part of text that was extracted using the Tesseract OCR

问题

I'm using the latest Tesseract OCR engine in R to extract text from a couple of images. It works pretty well and I'm happy with the results. The problem is that I don't want the whole text, just some part, but I don't know how to extract it.

Code is this:

library("tesseract") 
library("pdftools")
library("magick")

mypdfFile<-"C:/Users/.../fileName.pdf"

mypngFile<-pdf_convert(mypdfFile, format="png", pages=1, dpi=600)

myImage<-image_read("fileName_1.png")

textFile<-ocr(myImage,engine = tesseract("spa"), HOCR = FALSE) # Text is in spanish

cat(textFile)

Now, the end result looks like this

bla bla bla bla bla bla 
bla text that I want to 
extract bla bla bla bla 
bla bla bla bla bla bla

How can I get the text that I want to extract and only that?

I tried to crop the image before applying the ocr() function, but it's not feasible or very accurate to just crop that part. ocr() returns plain text.

Full example below

The image (originally a pdf file) is an electricity bill. I can't provide it in full due to privacy issues, but it looks like this sample image. Under NOMBRE Y DIRECCION (name and address), there should be two lines (one with the name and the other with the address) followed by "GALEANA CENTRO LERDO. C.P. " (the name of the city) and "35150 LERDO,DGO." (zip code and state). My code looks like this

myImage<-image_read("sampleImage.png")

myImage<-image_crop(myImage, new dimensions) #crop the right half and some from the top

textFile<-ocr(myImage,engine = tesseract("spa"), HOCR = FALSE) 

cat(textFile)

I get

Nombre y Domicilio
NAME REDACTED 
ADDRESS REDACTED
GALEANA CENTRO LERDO. C.P.
35150 LERDO, DGO.
Cuenta E Tarifa
30DC27B011164660 General < 25kW 02
AE A MA E
Num. de Lectura Lectura Mult. C
Medidor actual anterior
BD6687 40994 40539 1 ¿
Apoyo gubernamental

I just want to extract from this everything between "NAME REDACTED" and "35150 LERDO, DGO." inclusive.

回答1:

You could either crop the image first if you know where your text is, or you could restrict what tesseract is looking for using for example a whitelist, see here.

EDIT: After comments, we could indeed retrieve the address, here using the logic "the two lines after the line where "Address" is mentioned

text <- ("Nombre y Domicilio
NAME REDACTED 
ADDRESS REDACTED
GALEANA CENTRO LERDO. C.P.
35150 LERDO, DGO.
Cuenta E Tarifa
30DC27B011164660 General < 25kW 02
AE A MA E
Num. de Lectura Lectura Mult. C
Medidor actual anterior
BD6687 40994 40539 1 ¿
Apoyo gubernamental")

library(dplyr)
text2 <- strsplit(text, "\n") %>% unlist()
addressline <- which(grepl("address", text2, ignore.case = T))
addresslines <- c(addressline+1:2)
address_extracted <- text2[addresslines]
address_extracted
[1] "GALEANA CENTRO LERDO. C.P." "35150 LERDO, DGO."

来源：https://stackoverflow.com/questions/53185255/select-part-of-text-that-was-extracted-using-the-tesseract-ocr

标签

ocr

tesseract