Whitespace gone from PDF extraction, and strange word interpretation

前端 未结 6 2098
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-01 11:26

Using the snippet below, I\'ve attempted to extract the text data from this PDF file.

import pyPdf

def get_text(path):
    # Load PDF into pyPDF
    pdf = p         


        
6条回答
  •  清歌不尽
    2020-12-01 12:00

    As an alternative to PyPDF2, I suggest pdftotext:

    #!/usr/bin/env python
    
    """Use pdftotext to extract text from PDFs."""
    
    import pdftotext
    
    with open("foobar.pdf") as f:
        pdf = pdftotext.PDF(f)
    
    # Iterate over all the pages
    for page in pdf:
        print(page)
    

提交回复
热议问题