How can I use Tika package(https://github.com/chrismattmann/tika-python) in python(2.7) to parse PDF files?

前端 未结 5 1352
南旧
南旧 2021-01-01 07:48

I\'m trying to parse a few PDF files that contain engineering drawings to obtain text data in the files. I tried using TIKA as a jar with python and using it with the jnius

5条回答
  •  清酒与你
    2021-01-01 08:33

    Install tika with the following pip command:

    pip install tika
    

    The following code works fine for extracting data:

    import io
    import os
    from tika import parser
    
    def extract_text(file):
        parsed = parser.from_file(file)
        parsed_text = parsed['content']
        parsed_text = parsed_text.lower()
        return parsed_text
    
    file_name_with_extension = input("Enter File Name:")
    text = extract_text(file_name_with_extension)
    print(text)
    

    It will print only content of the file. Supported file formats are listed here.

提交回复
热议问题