Whitespace gone from PDF extraction, and strange word interpretation

前端 未结 6 2102
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-01 11:26

Using the snippet below, I\'ve attempted to extract the text data from this PDF file.

import pyPdf

def get_text(path):
    # Load PDF into pyPDF
    pdf = p         


        
6条回答
  •  旧时难觅i
    2020-12-01 12:19

    Without using the PyPdf2 use Pdfminer library package which has same functionality, as bellow. I got the code from this and as i wanted I edited it, this code gives me a text file which has white-space among words. I work with anaconda and python 3.6. for install PdfMiner for python 3.6 you can use this link.

    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.converter import TextConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfpage import PDFPage
    from io import StringIO
    
    class PdfConverter:
    
       def __init__(self, file_path):
           self.file_path = file_path
    # convert pdf file to a string which has space among words 
       def convert_pdf_to_txt(self):
           rsrcmgr = PDFResourceManager()
           retstr = StringIO()
           codec = 'utf-8'  # 'utf16','utf-8'
           laparams = LAParams()
           device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
           fp = open(self.file_path, 'rb')
           interpreter = PDFPageInterpreter(rsrcmgr, device)
           password = ""
           maxpages = 0
           caching = True
           pagenos = set()
           for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True):
               interpreter.process_page(page)
           fp.close()
           device.close()
           str = retstr.getvalue()
           retstr.close()
           return str
    # convert pdf file text to string and save as a text_pdf.txt file
       def save_convert_pdf_to_txt(self):
           content = self.convert_pdf_to_txt()
           txt_pdf = open('text_pdf.txt', 'wb')
           txt_pdf.write(content.encode('utf-8'))
           txt_pdf.close()
    if __name__ == '__main__':
        pdfConverter = PdfConverter(file_path='sample.pdf')
        print(pdfConverter.convert_pdf_to_txt())
    

提交回复
热议问题