Using the snippet below, I\'ve attempted to extract the text data from this PDF file.
import pyPdf
def get_text(path):
# Load PDF into pyPDF
pdf = p
PDFBox is a pretty good tool for extracting text from PDF files using Java. Text extraction is its strength; if you want to modify/annotate or view PDF files, another tool might serve you better. It has code for identifying spaces in files.
It also has code for handling ligatures, but you need to have a certain internationalization library on the classpath for that to work -- Icu4j.
You could call the PDFBox text extractor from Python as a command-line program, without writing any Java code.