Complementing @Sarah's answer. PDFMiner is a pretty good choice. I have been using it from quite some time, and until now it works pretty good on extracting the text content from a PDF. What I did is to create a function which uses the CLI client from pdfminer, and then it saves the output into a variable (which I can use later on somewhere else). The Python version I am using is 3.6
, and the function works pretty good and does the required job, so maybe this can work for you:
def pdf_to_text(filepath):
print('Getting text content for {}...'.format(filepath))
process = subprocess.Popen(['pdf2txt.py', filepath], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
stdout, stderr = process.communicate()
if process.returncode != 0 or stderr:
raise OSError('Executing the command for {} caused an error:\nCode: {}\nOutput: {}\nError: {}'.format(filepath, process.returncode, stdout, stderr))
return stdout.decode('utf-8')
You will have to import the subprocess module of course: import subprocess