Whitespace gone from PDF extraction, and strange word interpretation

前端 未结 6 2104
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-01 11:26

Using the snippet below, I\'ve attempted to extract the text data from this PDF file.

import pyPdf

def get_text(path):
    # Load PDF into pyPDF
    pdf = p         


        
6条回答
  •  忘掉有多难
    2020-12-01 12:02

    Your PDF file doesn't have printable space characters, it simply positions the words where they need to go. You'll have to do extra work to figure out the spaces, perhaps by assuming multi-character runs are words, and put spaces between them.

    If you can select text in the PDF reader, and have spaces appear properly, then at least you know there is enough information to reconstruct the text.

    "fi" is a typographic ligature, shown as a single character. You may find this is also happening with "fl", "ffi", and "ffl". You can use string replacement to substitute "fi" for the fi ligature.

提交回复
热议问题