How can I use Tika package(https://github.com/chrismattmann/tika-python) in python(2.7) to parse PDF files?

前端 未结 5 1362
南旧
南旧 2021-01-01 07:48

I\'m trying to parse a few PDF files that contain engineering drawings to obtain text data in the files. I tried using TIKA as a jar with python and using it with the jnius

5条回答
  •  暗喜
    暗喜 (楼主)
    2021-01-01 08:12

    can you please share the file you are looking at? The easiest way to do this would be to perhaps attach it to a Github issue in my repository, etc.

    That said, if you are trying to use OCR and Tika, you need to run through the Tika OCR guide (http://wiki.apache.org/tika/TikaOCR) and get Tesseract installed. Once Tesseract is installed, then you need to double check whether or not you have an instance of tika-server running (e.g., ps aux | grep tika). If you do, kill it (tika-python runs the Tika REST server in the background as its main interface to Tika; having a fresh running version of it after Tesseract OCR is installed helps to eliminate any odd possibilities).

    After you have Tesseract OCR installed, no tika-server running, start your python2.7 interpreter (or script), and then do something like:

    from tika import parser
    parsed = parser.from_file('/path/to/file')
    print parsed["content"] # should be the text returned from OCR
    

    HTH! --Chris

提交回复
热议问题