Convert scanned pdf to text python

爱⌒轻易说出口 提交于 2019-12-31 12:12:29

问题


I have a scanned pdf file and I try to extract text from it. I tried to use pypdfocr to make ocr on it but I have error:

"could not found ghostscript in the usual place"

After searching I found this solution Linking Ghostscript to pypdfocr in Windows Platform and I tried to download GhostScript and put it in environment variable but it still has the same error.

How can I searh text in my scanned pdf file using python?

Thanks.

Edit: here is my code sample:

import os
import sys
import re
import json
import shutil
import glob
from pypdfocr import pypdfocr_gs
from pypdfocr import pypdfocr_tesseract 
from PIL import Image

path = PATH_TO_MY_SCANNED_PDF
mainL = []
kk = {}


def new_init(self, kk):
    self.lang = 'heb'   
    self.binary = "tesseract"
    self.msgs = {
            'TS_MISSING': """ 
                Could not execute %s
                Please make sure you have Tesseract installed correctly
                """ % self.binary,
            'TS_VERSION':'Tesseract version is too old',
            'TS_img_MISSING':'Cannot find specified tiff file',
            'TS_FAILED': 'Tesseract-OCR execution failed!',
        }

pypdfocr_tesseract.PyTesseract.__init__ = new_init  

wow = pypdfocr_gs.PyGs(kk)
tt = pypdfocr_tesseract.PyTesseract(kk)


def secFile(filename,oldfilename):
    wow.make_img_from_pdf(filename)


    files = glob.glob("X:/e206333106/ocr-114/balagan/" + '*.jpg')  
    for file in files:
        im = Image.open(file)
        im.save(file + ".tiff") 

    files = glob.glob("PATH" + '*.tiff')  
    for file in files:
        tt.make_hocr_from_pnm(file)
    pdftxt = ""    
    files = glob.glob("PATH" + '*.html') 
    for file in files:
        with open(file) as myfile:
            pdftxt = pdftxt + "#" + "".join(line.rstrip() for line in myfile)
    findNum(pdftxt,oldfilename)

    folder ="PATH"

    for the_file in os.listdir(folder):
        file_path = os.path.join(folder, the_file)
        try:
            if os.path.isfile(file_path):
                os.unlink(file_path)
        except Exception, e:
            print e

def pdf2ocr(filename):
    pdffile = filename
    os.system('pypdfocr -l heb ' + pdffile)

def ocr2txt(filename):  
    pdffile = filename


    output1 = pdffile.replace(".pdf","_ocr.txt")
    output1 = "PATH" + os.path.basename(output1)

    input1 = pdffile.replace(".pdf","_ocr.pdf")

    os.system("pdf2txt" -o  + output1 + " " + input1) 

    with open(output1) as myfile:
        pdftxt="".join(line.rstrip() for line in myfile)
    findNum(pdftxt,filename)


def findNum(pdftxt,pdffile):
    l = re.findall(r'\b\d+\b', pdftxt)


    output = open('PATH' + os.path.basename(pdffile) + '.txt', 'w')
    for i in l:
        output.write(",")
        output.write(i)
    output.close()    

def is_ascii(s):
    return all(ord(c) < 128 for c in s)

i = 0     
files = glob.glob(path + '\\*.pdf') 
print path  
print files 
for file in files:
    if file.endswith(".pdf"):
        if is_ascii(file):
            print file
            pdf2ocr(file)    
            ocr2txt(file)
        else:
            newname = "PATH" + str(i) + ".pdf"
            shutil.copyfile(file, newname)
            print newname
            secFile(newname,file)
        i = i + 1

files = glob.glob(path + '\\' + '*_ocr.pdf')         

for file in files:
    print file
    shutil.copyfile(file, "PATH" + os.path.basename(file))
    os.remove(file)

回答1:


Take a look at this library: https://pypi.python.org/pypi/pypdfocr but a PDF file can have also images in it. You may be able to analyse the page content streams. Some scanners break up the single scanned page into images, so you won't get the text with ghostscript.




回答2:


Take a look at my code it is worked for me.

import os
import io
from PIL import Image
import pytesseract
from wand.image import Image as wi
import gc



pdf=wi(filename=pdf_path,resolution=300)
pdfImg=pdf.convert('jpeg')

imgBlobs=[]
extracted_text=[]

def Get_text_from_image(pdf_path):
    pdf=wi(filename=pdf_path,resolution=300)
    pdfImg=pdf.convert('jpeg')
    imgBlobs=[]
    extracted_text=[]
    for img in pdfImg.sequence:
        page=wi(image=img)
        imgBlobs.append(page.make_blob('jpeg'))

    for imgBlob in imgBlobs:
        im=Image.open(io.BytesIO(imgBlob))
        text=pytesseract.image_to_string(im,lang='eng')
        extracted_text.append(text)

    return (extracted_text)

I fixed it for me by editing the /etc/ImageMagick-6/policy.xml and changed the rights for the pdf line to "read|write":

Open the terminal and change the path

cd /etc/ImageMagick-6
nano policy.xml
<policy domain="coder" rights="read" pattern="PDF" /> 
change to
<policy domain="coder" rights="read|write" pattern="PDF" />
exit

When i was extracting the pdf images to text i faced some issues please go through the below link

https://stackoverflow.com/questions/52699608/wand-policy-error- 
error-constitute-c-readimage-412

https://stackoverflow.com/questions/52861946/imagemagick-not- 
authorized-to-convert-pdf-to-an-image

Increasing the memory limit  please go through the below link
enter code here
https://github.com/phw/peek/issues/112
https://github.com/ImageMagick/ImageMagick/issues/396



回答3:


You can use OpenCV for python. There are a lot of examples about detection of text.



来源:https://stackoverflow.com/questions/45480280/convert-scanned-pdf-to-text-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!