Using the snippet below, I\'ve attempted to extract the text data from this PDF file.
import pyPdf
def get_text(path):
# Load PDF into pyPDF
pdf = p
Your PDF file doesn't have printable space characters, it simply positions the words where they need to go. You'll have to do extra work to figure out the spaces, perhaps by assuming multi-character runs are words, and put spaces between them.
If you can select text in the PDF reader, and have spaces appear properly, then at least you know there is enough information to reconstruct the text.
"fi" is a typographic ligature, shown as a single character. You may find this is also happening with "fl", "ffi", and "ffl". You can use string replacement to substitute "fi" for the fi ligature.