python how to use tika with existing jar file without downloading again

旧城冷巷雨未停 提交于 2020-03-18 12:44:35

问题


I'm using Tika and I realized that each time the jar file is downloaded and placed in Temp folder

Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar to C:\Users\asus\AppData\Local\Temp\tika-server.jar.
Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar.md5 to C:\Users\asus\AppData\Local\Temp\tika-server.jar.md5.

The problem is that the jar file size is around 60MB, which takes some time to download.

This is the code I'm using :

from tika import parser

def get_pdf_text(path):
    parsed = parser.from_file(path):
    return parsed['content']

The only workaround I found is this :

1 - Manually running the jar using java -jar tika-server-x.x.jar --port xxxx

2 - Using tika.TikaClientOnly = True

3 - Replacing parser.from_file(path) with parser.from_file(path, '/path/to/server')

But I don't want to run the jar file manually. It would be better if I can use Python to automatically run the jar file and setup tika with it without redownloading.


回答1:


To resolve this problem you should add an environment variable to the tika server jar and specify the path folder which contains the tika jar file.

TIKA_SERVER_JAR = 'PATH_OF_FOLDER_CONTAINING_TIKA_SERVER_JAR'.



来源:https://stackoverflow.com/questions/56559850/python-how-to-use-tika-with-existing-jar-file-without-downloading-again

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!