How can I use Tika package(https://github.com/chrismattmann/tika-python) in python(2.7) to parse PDF files?

前端未结

关注

 5  1352

南旧 2021-01-01 07:48

I\'m trying to parse a few PDF files that contain engineering drawings to obtain text data in the files. I tried using TIKA as a jar with python and using it with the jnius

5条回答

清酒与你 (楼主)

2021-01-01 08:33

Install tika with the following pip command:

pip install tika

The following code works fine for extracting data:

import io
import os
from tika import parser

def extract_text(file):
    parsed = parser.from_file(file)
    parsed_text = parsed['content']
    parsed_text = parsed_text.lower()
    return parsed_text

file_name_with_extension = input("Enter File Name:")
text = extract_text(file_name_with_extension)
print(text)

It will print only content of the file. Supported file formats are listed here.

0 讨论(0)

查看其它5个回答