How can I use Tika package(https://github.com/chrismattmann/tika-python) in python(2.7) to parse PDF files?

前端 未结 5 1341
南旧
南旧 2021-01-01 07:48

I\'m trying to parse a few PDF files that contain engineering drawings to obtain text data in the files. I tried using TIKA as a jar with python and using it with the jnius

5条回答
  •  情话喂你
    2021-01-01 08:23

    You need to download the Tika Server Jar and run it first. Check this link: http://wiki.apache.org/tika/TikaJAXRS

    1. Download the Jar
    2. Store it somewhere and run it as java -jar tika-server-x.x.jar --port xxxx
    3. In your Code you now don't need to do the tika.initVM() Add tika.TikaClientOnly = True instead of tika.initVM()
    4. Change parsed = parser.from_file('/path/to/file') to parsed = parser.from_file('/path/to/file', '/path/to/server') You will get the server path in Step 2. when the tika server initiates - just plug that in here

    Good luck!

提交回复
热议问题