Whitespace gone from PDF extraction, and strange word interpretation

前端 未结 6 2093
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-01 11:26

Using the snippet below, I\'ve attempted to extract the text data from this PDF file.

import pyPdf

def get_text(path):
    # Load PDF into pyPDF
    pdf = p         


        
6条回答
  •  时光取名叫无心
    2020-12-01 12:03

    PDFBox is a pretty good tool for extracting text from PDF files using Java. Text extraction is its strength; if you want to modify/annotate or view PDF files, another tool might serve you better. It has code for identifying spaces in files.

    It also has code for handling ligatures, but you need to have a certain internationalization library on the classpath for that to work -- Icu4j.

    You could call the PDFBox text extractor from Python as a command-line program, without writing any Java code.

提交回复
热议问题