pytesseract and image.tif file

巧了我就是萌 提交于 2020-01-24 18:09:42

问题


I need to transcribe an image.tif with several pages to text using pytesseract. I have the next code:

> From PIL import Image
> Import pytesseract
> Pytesseract.pytesseract.tesseract_cmd = 'C: / Program Files (x86) / Tesseract-
> OCR / tesseract '
> Print (pytesseract.image_to_string (Image.open ('CAMARA.tif'), lang = "spa"))

The problem is that only extract the firs page. How can i extract all of them?


回答1:


I was able to fix the same problem by calling the method convert() as below

image = Image.open(imagePath).convert("RGBA")
text = pytesseract.image_to_string(image)
print(text)



回答2:


I guess you have mentioned only one image "camara.tif" , First you have to convert all the pdf pages into images you can see this link for doing so.

And next use pytesseract to loop over images one by one to extract text from image.




回答3:


I just stumbled over the same problem... what you could do is call tesseract directly

# test.py
import subprocess

in_filename = 'file_0.tiff'
out_filename = 'out'
lang = 'spa'
subprocess.call(['tesseract', in_filename, '-l', lang, out_filename ])

would process all pages

$ python test.py 
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Page 1
Page 2
Page 3


来源:https://stackoverflow.com/questions/45292287/pytesseract-and-image-tif-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!