Convert scanned pdf to text python

泄露秘密 提交于 2019-12-02 21:21:58
ghovat

Take a look at this library: https://pypi.python.org/pypi/pypdfocr but a PDF file can have also images in it. You may be able to analyse the page content streams. Some scanners break up the single scanned page into images, so you won't get the text with ghostscript.

Take a look at my code it is worked for me.

import os
import io
from PIL import Image
import pytesseract
from wand.image import Image as wi
import gc



pdf=wi(filename=pdf_path,resolution=300)
pdfImg=pdf.convert('jpeg')

imgBlobs=[]
extracted_text=[]

def Get_text_from_image(pdf_path):
    pdf=wi(filename=pdf_path,resolution=300)
    pdfImg=pdf.convert('jpeg')
    imgBlobs=[]
    extracted_text=[]
    for img in pdfImg.sequence:
        page=wi(image=img)
        imgBlobs.append(page.make_blob('jpeg'))

    for imgBlob in imgBlobs:
        im=Image.open(io.BytesIO(imgBlob))
        text=pytesseract.image_to_string(im,lang='eng')
        extracted_text.append(text)

    return (extracted_text)

I fixed it for me by editing the /etc/ImageMagick-6/policy.xml and changed the rights for the pdf line to "read|write":

Open the terminal and change the path

cd /etc/ImageMagick-6
nano policy.xml
<policy domain="coder" rights="read" pattern="PDF" /> 
change to
<policy domain="coder" rights="read|write" pattern="PDF" />
exit

When i was extracting the pdf images to text i faced some issues please go through the below link

https://stackoverflow.com/questions/52699608/wand-policy-error- 
error-constitute-c-readimage-412

https://stackoverflow.com/questions/52861946/imagemagick-not- 
authorized-to-convert-pdf-to-an-image

Increasing the memory limit  please go through the below link
enter code here
https://github.com/phw/peek/issues/112
https://github.com/ImageMagick/ImageMagick/issues/396

You can use OpenCV for python. There are a lot of exampels to detect text. Here is the link enter link description here

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!