I\'m trying to parse a few PDF files that contain engineering drawings to obtain text data in the files. I tried using TIKA as a jar with python and using it with the jnius
Install tika with the following pip command:
pip install tika
The following code works fine for extracting data:
import io
import os
from tika import parser
def extract_text(file):
parsed = parser.from_file(file)
parsed_text = parsed['content']
parsed_text = parsed_text.lower()
return parsed_text
file_name_with_extension = input("Enter File Name:")
text = extract_text(file_name_with_extension)
print(text)
It will print only content of the file. Supported file formats are listed here.