tika-server

Apache Tika Server - Request Header Parameters?

馋奶兔 提交于 2021-02-08 06:51:02
问题 The Apache Tika Server provides a Rest API to extract text from a document. It is also possible to set specific request header parameters like X-Tika-PDFOcrStrategy . e.g: $ curl -T test/Dokument01.pdf http://localhost:9998/tika --header "X-Tika-PDFOcrStrategy: ocr_only" From a lot of different documents about tika I found these documented additional header parameters: X-Tika-OCRLanguage: eng X-Tika-PDFextractInlineImages: true | false X-Tika-PDFOcrStrategy: ocr_only | ocr_and_text_extraction

Define a MIME type for .TXT files for Tika

浪尽此生 提交于 2019-12-13 20:15:21
问题 I want to define the MIME type of *.txt files: text/txt , so that Tika can apply a more specific parser than the one used for text/plain files. The glob *.txt is included in the definition of the type text/plain in tika-mimetypes.xml . Moreover, it seems to me that you cannot redefine a MIME type in custom-mimetypes.xml , only add new globs or magic patterns. Additionally, if I define the text/txt type in tika-mimetypes.xml as a subtype of text/plain with only the glob *.txt , Tika still

422 Tika server response? Tika-Python

一曲冷凌霜 提交于 2019-12-11 19:47:49
问题 I have been trying to get Apache-Tika to work with this python package: https://github.com/chrismattmann/tika-python I have the following code in my python program: #!/usr/bin/env python import tika tika.initVM() from tika import parser parsed = parser.from_file('pdf/myPdf.pdf') But I get a 422 response every time: [MainThread ] [WARNI] Failed to see startup log message; retrying... [MainThread ] [WARNI] Tika server returned status: 422 Apache Tika does work when I use the following command: