Apache Tika Server - Request Header Parameters?

馋奶兔 提交于 2021-02-08 06:51:02

问题


The Apache Tika Server provides a Rest API to extract text from a document. It is also possible to set specific request header parameters like X-Tika-PDFOcrStrategy. e.g:

$ curl -T test/Dokument01.pdf http://localhost:9998/tika --header "X-Tika-PDFOcrStrategy: ocr_only"

From a lot of different documents about tika I found these documented additional header parameters:

X-Tika-OCRLanguage: eng
X-Tika-PDFextractInlineImages: true | false
X-Tika-PDFOcrStrategy: ocr_only  |  ocr_and_text_extraction
X-Tika-OCRoutputType: hocr

But there seems to be no documentation about how to use the X-Tika-.....? header parameters or which parameters are supported and which not.

For example I wonder if it is possible to overwrite the ImageType mode or the DPI with something like:

X-Tika-PDFocrImageType: rgb
X-Tika-PDFocrDPI: 100

My question is: Which header parameters are supported and which naming convention did these params follow?


回答1:


The code that handles the X-Tika-OCR and X-Tika-PDF headers is TikaResource.processHeaderConfig.

Those header suffixes and values are then mapped onto the TesseractOCRConfig and PDFParserConfig configuration objects via reflection.

So, to see what X-Tika headers you can set, look up the options on the config class you want to tweak things on (Tesseract or PDF), then build the name, then set the header. If you are not sure what the option does, or what values it takes, look at the JavaDocs for the underlying setter method that will get called.

For eg setExtractInlineImages on PDF, that maps to X-Tika-PDFextractInlineImages



来源:https://stackoverflow.com/questions/62011038/apache-tika-server-request-header-parameters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!