I\'m trying to extract the text included in this PDF file using Python
.
I\'m using the PyPDF2 module, and have the following script:
imp
I've try many Python PDF converters, and I like to update this review. Tika is one of the best. But PyMuPDF is a good news from @ehsaneha user.
I did a code to compare them in: https://github.com/erfelipe/PDFtextExtraction I hope to help you.
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
from tika import parser
raw = parser.from_file("///Users/Documents/Textos/Texto1.pdf")
raw = str(raw)
safe_text = raw.encode('utf-8', errors='ignore')
safe_text = str(safe_text).replace("\n", "").replace("\\", "")
print('--- safe text ---' )
print( safe_text )