问题
I am using Ubuntu.
Here is my Image that i get from internet.
My concern is to get data as it is formated in the Image
and dump it into the Text file (position has to be maintained (95-97% accuracy))
I am working with tesseract-ocr
almost same question is here
my code-:
import cv2
import pytesseract
from pytesseract import Output
import numpy as np
img = cv2.imread("/demo.jpg")
d1 = pytesseract.image_to_data(img)
print(d1)
It gives me completely a wrong output from what I am expecting
In short, I want to convert this Image(with alignment) to text file (or CSV file).
Thanks in Advacne
回答1:
You can use tesseract output in HOCR to retain positional information. Converting these kinds of documents directly into text retaining positional information is a very tricky and hard problem. I can give you an intermediate solution that can give you a data frame with each word and its coordinates so that you can parse it to extract key-value information using the coordinates.
### this will save the tesseract output as "demo.hocr"
pytesseract.pytesseract.run_tesseract(
"demo.jpg", "demo",
extension='.html', lang='eng', config="hocr")
HOCR is an HTML like representation that contains a lot of metadata like line information, word information, its coordinates, etc present.
For better handling, I have a parser that will directly parse it and give you a data frame with words and its coordinates.
I have created a package in pip called tesseract2dict for this.
You can easily install it using pip install tesseract2dict
This is how you can use that.
import cv2
from tesseract2dict import TessToDict
td=TessToDict()
inputImage=cv2.imread('path/to/image.jpg')
### function 1
### this is for getting word level information as a dataframe
word_dict=td.tess2dict(inputImage,'outputName','outfolder')
### function 2
### this is for getting plain text for a given coordinates as (x,y,w,h)
text_plain=td.word2text(word_dict,(0,0,inputImage.shape[1],inputImage.shape[0]))
PS: This package is only compatible with Tesseract 5.0.0
回答2:
You can leverage pytesseract parameters to achieve what you're looking for.
More specifically that Output
class you imported holds all the supported output types by pytesseract
import cv2
import pytesseract
from pytesseract import Output
import numpy as np
img = cv2.imread("/demo.jpg")
# my favorite type is Output.DICT but since you mentioned CSV
d1 = pytesseract.image_to_data(img, output_type=Output.DATAFRAME)
print(type(d1))
d1.to_csv('ocr_dump.csv')
来源:https://stackoverflow.com/questions/62172144/how-to-fetch-info-in-structure-formate-with-tesseract-ocr-in-python