apache-tika

Apache Tika Server - Request Header Parameters?

馋奶兔 提交于 2021-02-08 06:51:02
问题 The Apache Tika Server provides a Rest API to extract text from a document. It is also possible to set specific request header parameters like X-Tika-PDFOcrStrategy . e.g: $ curl -T test/Dokument01.pdf http://localhost:9998/tika --header "X-Tika-PDFOcrStrategy: ocr_only" From a lot of different documents about tika I found these documented additional header parameters: X-Tika-OCRLanguage: eng X-Tika-PDFextractInlineImages: true | false X-Tika-PDFOcrStrategy: ocr_only | ocr_and_text_extraction

TikaException: Failed to close temporary resource - how to fix?

↘锁芯ラ 提交于 2021-01-29 07:50:40
问题 I am using Apache Tika on Windows 10, jre 1.8.0_181, and I've imported Tika using Maven with the following dependencies: <dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>3.8.1</version> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>1.21</version> </dependency> </dependencies> I have the code below for performing OCR using Tesseract (which I have independently tested

Why does the Tika facade choose EmptyParser?

纵然是瞬间 提交于 2021-01-27 19:10:39
问题 I'm using the Tika facade, per the example of the elasticsearch-mappper-attachment plugin. Here's my test code: Tika tika = new Tika(); Metadata md = new Metadata(); try { String content = tika.parseToString(src, md, 100000); System.out.println("Content length: " + content.length()); for (String s: md.names()) { System.out.println(s + ": " + md.get(s)); } } catch (TikaException e) { System.out.println(e); } Here's the output: Content length: 0 X-Parsed-By: org.apache.tika.parser.EmptyParser

Apache Solr - Indexing ZIP files

空扰寡人 提交于 2021-01-07 06:59:24
问题 My web app is an e-mail service. It stores email messages in MySQL database and email attachments are on a disk. The database is similar to: ---------------------------------------------------------------------- | id | sender | receiver | subject | body | attach_dir | attachments | ---------------------------------------------------------------------- | 2 | 444 | 555 | Apples | Hey! | /mnt/emails| att1.doc\r\n| | | | | | | | att2.doc\r\n| ------------------------------------------------------

Apache Solr - Indexing ZIP files

徘徊边缘 提交于 2021-01-07 06:59:07
问题 My web app is an e-mail service. It stores email messages in MySQL database and email attachments are on a disk. The database is similar to: ---------------------------------------------------------------------- | id | sender | receiver | subject | body | attach_dir | attachments | ---------------------------------------------------------------------- | 2 | 444 | 555 | Apples | Hey! | /mnt/emails| att1.doc\r\n| | | | | | | | att2.doc\r\n| ------------------------------------------------------

Solr ExtractingRequestHandler extracting “rect” in links

若如初见. 提交于 2020-12-29 13:25:07
问题 I am utilizing solr ExtractingRequestHandler to extract and index HTML content. My issue comes to the extracted links section that it produces. The extracted content returned has "rect" inserted where they do not exist in the HTML source. I have my solrconfig cell configuration as follows: <requestHandler name="/upate/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="lowernames">true</str> <!-- capture link hrefs but ignore div

Solr ExtractingRequestHandler extracting “rect” in links

依然范特西╮ 提交于 2020-12-29 13:23:47
问题 I am utilizing solr ExtractingRequestHandler to extract and index HTML content. My issue comes to the extracted links section that it produces. The extracted content returned has "rect" inserted where they do not exist in the HTML source. I have my solrconfig cell configuration as follows: <requestHandler name="/upate/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="lowernames">true</str> <!-- capture link hrefs but ignore div

Configure Tesseract with solr 6.4.1

ぃ、小莉子 提交于 2020-06-28 06:30:18
问题 How to configure Tika OCR with solr 6.4.1. I indexed documents including PDF, images and MS office documents but problem was occurred Tika was not extracting text from images and also from images which are inside PDF and MS office documents. for this I researched Tika OCR is used. for this purpose i am installing tika-app-1.7.jar and Tesseract but i don't know how to configure them with my solr core. 回答1: You don't need to do anything special. Simply get the Tesseract OCR setup for your

python how to use tika with existing jar file without downloading again

旧城冷巷雨未停 提交于 2020-03-18 12:44:35
问题 I'm using Tika and I realized that each time the jar file is downloaded and placed in Temp folder Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar to C:\Users\asus\AppData\Local\Temp\tika-server.jar. Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar.md5 to C:\Users\asus\AppData\Local\Temp\tika-server.jar.md5. The problem is that the jar file size is around 60MB, which